ROSANA-ML Archive - Dr. Roland Stuckardt

ROSANA-ML

ROSANA-ML is a system for the resolution of anaphors in natural language text based on machine-learned decision trees. The acronym ROSANA-ML stands for robust syntax-based anaphor interpretation employing machine-learned decision trees At current, the system focuses on the resolution of third person non-possessive and possessive pronouns. In implementing and evaluating ROSANA-ML, it is investigated what may be gained by employing machine-learned preference strategies as part of a robust anaphor resolution approach according to the Lappin & Leass (1994) paradigm in which the antecedent filtering strategies are manually designed. The manually crafted algorithm ROSANA is taken as the starting point. Empirical …

Methodology

In the figure displayed below, the machine learning approach to anaphor resolution followed by ROSANA-ML is outlined. It is distinguished between the training phase, which is shown in the upper part of the figure, and the application (anaphor resolution) phase sketched in the lower part of the figure. Training Phase During the training phase, based on a training text corpus, a set of feature vectors is generated which consists of feature tuples derived from the (anaphor, antecedent candidate) pairs that are considered during the antecedent selection phase of the anaphor resolution algorithm ROSANA. This output is written to …

Algorithms

Training Data Generation Step 1 – antecedent filtering -, in which different kinds of restrictions for eliminating impossible antecedents (in particular, agreement in person/number/gender and syntactic disjoint reference) are applied, is immediately taken over from the original ROSANA algorithm. In step 2, however, no salience ranking of the remaining antecedent candidates is performed. Rather, each remaining anaphor-candidate pair (A,C) is mapped to a feature vector fv(A,C), the attributes f1,…,fk of which comprise individual and relational features derived from the descriptions of the occurrences A and C. The signature of the feature vectors, i.e. the inventory of features to be taken …

Languages

Because of the currently employed syntactic analysis frontend (Timo Järvinen´s and Pasi Tapanainen´s FDG (Functional Dependency Grammar) Parser for English), the current version of ROSANA-ML processes texts in English. The core algorithm of ROSANA-ML, however, is applicable to the wide class of languages.

Experiments

Full details of the experimental variations are given in the book article [PDF] published in 2005. The most important stage of experimental variation regards the determination of the empirically optimal set of attributes over which the decision trees are learned (i. e., the signature of the feature vectors). The considered attributes (synonymously referred to as features) are mainly based on syntactic, morphological, and surface information, thus meeting the requirements of knowledge poorness and robustness. It is distinguished between non-relational features (pertaining to individual anaphoric or candidate occurrences O) and relational features (pertaining to pairs of anaphors and antecedent candidates (A,C)). …

Results of Corpus-Based Evaluation

Text Corpus The training and evaluation of the ROSANA-ML system has been performed on the same corpus that has been employed for evaluating ROSANA, viz. a corpus of 66 news agency press releases, comprising 24,712 words, 1093 sentences (on average 22.61 words/sentence), 406 third-person non-possessives (PER3), and 246 third-person possessive pronouns (POS3). For cross-validation, random partitions of this corpus have been generated (see the details in the book article [PDF] published in 2005.). In all experiments, the training data generation and the application of the trained system take place under conditions of potentially noisy data, i.e. without a-priori intellectual correction …

Implementation

The core system of ROSANA-ML has been implemented in Common Lisp and has been run under Allegro Common Lisp, Version 4.3 for Linux as well as under Xanalys Lispworks 4.2.0 for Linux. It is made up of 12,461 lines of LISP source code (including a basic graphical user interface for Xanalys Lispworks). The architecture of the system is modular, i.e. it supports the adaptation to different syntactic analysis frontends. A seperate software module, comprising 2,846 lines of LISP code, implements the formal measures for the corpus-based evaluation. At current, the core system of ROSANA-ML is neither technically coupled with the …

Distribution and Documentation

A non-commercial, non-profit research license of ROSANA-ML is available upon request. Details of the licensing conditions are given in the License Agreement for ROSANA-ML and ROSANA [PDF]. As a prerequisite for obtaining access to the ROSANA-ML distribution, two hardcopies of this form have to be completed, signed, and mailed in. It has to be kept in mind that the current distribution of ROSANA-ML is rather experimental. A first impression can be gained by studying the document Getting Started with ROSANA-ML [PDF], which provides details of how to run and test ROSANA-ML on the corpus included in the distribution.

Background Information

A detailed description of ROSANA-ML and the underlying methodology is provided in the article Roland Stuckardt. A Machine Learning Approach to Preference Strategies for Anaphor Resolution. In: António Branco, Tony McEnery, Ruslan Mitkov (Eds.), Anaphora Processing: Linguistic, Cognitive, and Computational Modelling. John Benjamins, January 2005. [PDF]