Results of Corpus-Based Evaluation

Text Corpus

The training and evaluation of the ROSANA-ML system has been performed on the same corpus that has been employed for evaluating ROSANA, viz. a corpus of 66 news agency press releases, comprising 24,712 words, 1093 sentences (on average 22.61 words/sentence), 406 third-person non-possessives (PER3), and 246 third-person possessive pronouns (POS3). For cross-validation, random partitions of this corpus have been generated (see the details in the book article [PDF] published in 2005.). In all experiments, the training data generation and the application of the trained system take place under conditions of potentially noisy data, i.e. without a-priori intellectual correction of orthographic or syntactic errors.

Disciplines of Formal Evaluation

As in the case of ROSANA, the anaphor resolution performance has been evaluated with respect to two evaluation disciplines: immediate antecedency (ia) and non-pronominal anchors (na). In the first-mentioned discipline, the accuracy of the immediate antecedent choices is measured; in the second-mentioned discipline, the accuracy with respect to the (application-relevant) determination of non-pronominal antecedents is evaluated. By further distinguishing between precision and recall (thus appropriately accounting for cases in which not every anaphor is assigned an antecedent), the respective tradeoffs (Pia,Ria) and (Pna,Rna) are obtained.

Quality of anaphor interpretation results

According to the averaged results of a 6-fold cross-validation, based on a random partition of the document set into six subsets of eleven documents each, ROSANA-ML performs as follows:

    • third person nonpossessive pronouns (PER3):
    • (Pia,Ria) = (0.66,0.66)

 

      (Pna,Rna) = (0.62,0.62)

    • third person possessive pronouns (POS3):
    • (Pia,Ria) = (0.75,0.75)

 

    (Pna,Rna) = (0.68,0.68)

Since, in both cases, all pronouns are assigned an antecedent, precision equals recall; thus, in fact, this figures express accuracy.

Discussion and Comparison

According to the results of the evaluation runs, it can be concluded that, by employing a machine learning approach to preference strategies for anaphor resolution, results that at least compare with those of the best manually tuned systems can be reached. With respect to the current best settings and regarding possessive third person pronouns, the resolution quality is slightly higher than for the ancestor system ROSANA, whereas, regarding non-possessives, the quality slightly lags behind.

Biasing ROSANA-ML towards high precision: ROSANA-ML-theta

In a further series of experiments in which ROSANA-ML was biased against precision (thus, leaving subsets of PER3 and POS3 pronouns unresolved), evidence has been gained that better (P,R) tradeoffs can be obtained than those achieved by the approach of Aone & Bennett (1995). E. g., in a particular setting of the threshold theta employed for precision biasing, ROSANA-ML-theta performed as follows:

    • third person nonpossessive pronouns (PER3):
    • (Pia,Ria) = (0.81,0.45)

 

      (Pna,Rna) = (0.74,0.36)

    • third person possessive pronouns (POR3):
    • (Pia,Ria) = (0.89,0.50)

 

    (Pna,Rna) = (0.67,0.30)

While this can be interpreted as an indicator that the approach employed by ROSANA-ML, which focuses on machine learning preferences, may lead to a better overall performance, it has to be kept in mind that the cases of English third-person pronouns and Japanese zero-pronouns do not immediately compare. A more detailed discussion of the results can be found in a book article [PDF] published in 2005.

Schreibe einen Kommentar