In the formal evaluation of ROSANA, three interpretation disciplines are distinguished:
- OV: Identification of occurrences („referring“ linguistic expressions):
- KV: Coreference class determination (similar to the MUC CO task):
- PS: determination of non-pronominal lexical substitutes for pronouns (i.e. common nouns or names that belong to the same coreference class as the pronoun under consideration).
In the following sections, the underlying formal evaluation measures and the respective results of ROSANA are discussed in more detail.
The OV Task: Identifying Occurrences of Discourse Referents
The identification of linguistic expressions that specify the semantic entities the discourse is about constitutes the basic requirement for the resolution of anaphoric expressions. The evaluation focuses on object anaphora. Other types , e.g. event anaphora, are not taken into consideration because their algorithmic treatment as well as the definition of an appropriate evaluation scenario imposes further difficulties that are regarded to be out of the scope of current research. Due to this restriction, the problem coincides in large parts with the identification of the noun phrases of the text to be interpreted. There are, however, several exceptions that render the problem more difficult, e.g. essentially non-referring nominal expressions like expletive occurrences of the pronoun it.
In the evaluation, the occurrences that are specified in the key are aligned with the occurrences that are identified by ROSANA. Precision errors are occurrences in the system output that are not specified in the key. Analogously, recall errors are contributed by key occurrences without counterpart in the ROSANA results. By furtherly taking into account the occurrences in the intersection of system respond and key, one obtains the formal measures of precision (P) and recall (R) in the usual way. In the case of ROSANA, the following results have been determined
EVALUATIONSERGEBNIS OKKURRENZEN:
- NUR ANA: 243
- NUR KEY: 150
- ANA UND KEY: 3831
=> PRECISION: 0.9404
=> RECALL: 0.9623
i.e. P = (3831/(3831+243) = 0.9404, R = (3831/(3831+150) = 0.9623.
This means that ROSANA identifies referential expressions with high precision and an even better recall.
The KV Task: Identifying Equivalence Classes of Cospecifying Occurrences
Due to several reasons, the mere considerartion of individual instances of sequential resumptions of antecedent expressions by anaphoric expressions has to be regarded as inadequate. In particular, this is due to the fact that, in general, the correct antecedent is not defined uniquely. Instead, there exist convincing arguments that the problem of anaphor resolution (at least as long as it is confined to instances of identity of reference) should be considered as the task of identifying equivalence classes of referring expressions. Following an approach developed by Vilain, Burger, Aberdeen, Connolly and Hirschman for the MUC-6 Coreference Task evaluation, the definitions of the precision and recall measures for the KV task are based on an alignment of the key equivalence classes with the equivalence classes determined by ROSANA. Precision errors correspond to cuts that are induced on the response classes by the key classes. Conversely, recall errors correspond to cuts that are induced on the key classes by the response classes. Based on these definitions, the following results have been determined for ROSANA
A.R.-ERGEBNIS-PARTITIONIERUNG
- SCHNITTE: 256
- MOEGLICH: 1334
=> PRECISION: 0.8081
KEY-PARTITIONIERUNG
- SCHNITTE: 496
- MOEGLICH: 1572
=> RECALL: 0.6845
DURCHSCHNITT JE DOKUMENT
=> PRECISION: 0.8191
=> RECALL: 0.6659
i.e. P = (1334-256)/1334 = 0.8081, R = (1572-496)/1572 = 0.6845.
By restricting the alignment of equivalence classes to occurrences that are contained in the intersection of key and system response, the KV task evaluation is conceptually separated from the OV task evaluation. This implies that, on one hand, the results are not directly compareable with the figures that were determined during the MUC CO Task evaluations. On the other hand, the evaluation results indicate more clearly the distribution of interpretation errors over two different stages of processing.
In a further stage of evaluation, the performance of ROSANA for different types of anaphoric expressions has been singled out. These results shed light on the quality with which individual decisions are typically performed:
ANKNUEPFUNGS-ENTSCHEIDUNGEN
| | PRECIS | ++ | +- | +? | +_ | +* | ?+ | ?_ |
-----+------+--------+------+------+------+------+------+------+------+
PRON | PE-3 | 0.7143 | 145 | 48 | 10 | 1 | 0 | 18 | 0 |
| PE12 | 0.9474 | 18 | 1 | 0 | 7 | 6 | 0 | 0 |
| PO-3 | 0.7634 | 100 | 28 | 3 | 0 | 0 | 0 | 0 |
| PO12 | 1.0000 | 3 | 0 | 0 | 1 | 1 | 0 | 0 |
| REFL | 1.0000 | 3 | 0 | 0 | 1 | 0 | 0 | 0 |
| RELA | 0.7789 | 74 | 18 | 3 | 6 | 0 | 7 | 4 |
+------+--------+------+------+------+------+------+------+------+
| 0.7555 | 343 | 95 | 16 | 16 | 7 | 25 | 4 |
-----+------+--------+------+------+------+------+------+------+------+
NOMN | VNOM | 0.7014 | 357 | 136 | 16 | 1973 | 43 | 31 | 133 |
| NAME | 0.9390 | 308 | 15 | 5 | 368 | 5 | 5 | 28 |
+------+--------+------+------+------+------+------+------+------+
| 0.7945 | 665 | 151 | 21 | 2341 | 48 | 36 | 161 |
-----+------+--------+------+------+------+------+------+------+------+
DURCHSCHNITT JE ANAPHER
=> PRECISION: 0.7808
The rows correspond to the different types of anaphoric occurrences that are handled by ROSANA: PE-3
= third person pronouns, PE12
= personal pronouns in first/second person, PO-3/PO12
possessive pronouns, REFL
= reflexive pronouns, RELA
= relative pronouns, NAME
= names, VNOM
= common nonpronominal NPs. Columns correspond to the different classes of evaluation outcomes: ++
= correct decision, +-
= wrong decision, +?
= suggested antecedent doesn’t exist in the key specification (i.e. OV error), +_
= no antecedent suggested, +*
= no antecedent suggested and coreference is marked as optional in the key, ?+
= identified anaphoric occurrence has no counterpart in key (i.e. OV error), ?_
= identified anaphoric occurrence has no counterpart in key (i.e. OV error) and no antecedent has been suggested. The values in the PRECISION column have been computed by applying the formula
P = (++/(++ + +- + +?))
In compliance with an intuitive understanding of precision, this means that only instances of nonempty decisions are counted as errors in which an incorrect antecedent has been determined for a true occurrence, i.e. an occurrence that is specified in the key.
The results indicate that ROSANA performes compareably well. The different types of pronouns are resolved with a precision above 70%. The common types of third person pronouns (PE-3
, PO-3
) are more difficult to process because of a typically higher degree of contextual ambiguity the resolution of which requires robust resources for semantic and pragmatic interpretation that are not available by now.
The PS Task: Identifying nonpronominal Substitutes for Pronouns
From the point of view of typical applications, the evaluation figures discussed by now may be regarded not to be sufficiently expressive. This is because the mere ability to choose arbitrary correct antecedents with high precision does not necessarily generalize to a solution of the problem to identifying nonpronominal antecedents the stems of which may be used as a lexical substitutes for the pronominal anaphors to be resolved. In fact, it turns out that this problem tends to be slightly more difficult. The main reason is illustrated by the elementary observation that chains of (potentially incorrect) individual antecedent decisions with lengths typically greater then one have to be followed to determining the first non-pronominal occurrence that belongs to the same coreference equivalence class as an anaphoric expression; one single error suffices that the selected nonpronominal antecedent is incorrect. In addition, by resorting to focus theoretic arguments, it might be argued that nonpronominal antecedent candidates are typically less prominent in the discourse. Hence, the choice of a non-pronominal lexical anchor is typically more error-prone than the resumption of a pronominal (in some sense focused) occurrence.
ROSANA employs the strategy of selecting, if nonpronominal, the immediate (sequential) antecedent of a pronoun. If, however, this epression is pronominal too, the nonpronominal representative of the coreference class of the anaphor is chosen that has the smallest surface distance.
In the following table, the results of the PS task evaluation are summarized.
LEXIKALISCHE SUBSTITUTION
| PRECIS | RECALL | ++ | +- | +? | +_ | +* | ?+ | ?_ |
-----+--------+--------+------+------+------+------+------+------+------+
PE-3 | 0.6766 | 0.6667 | 136 | 54 | 11 | 3 | 0 | 18 | 0 |
PE12 | 0.9091 | 0.3846 | 10 | 1 | 0 | 15 | 6 | 0 | 0 |
PO-3 | 0.6641 | 0.6641 | 87 | 39 | 5 | 0 | 0 | 0 | 0 |
PO12 | 1.0000 | 0.5000 | 2 | 0 | 0 | 2 | 1 | 0 | 0 |
REFL | 1.0000 | 0.7500 | 3 | 0 | 0 | 1 | 0 | 0 | 0 |
RELA | 0.7667 | 0.6832 | 69 | 18 | 3 | 11 | 0 | 7 | 4 |
-----+--------+--------+------+------+------+------+------+------+------+DURCHSCHNITT JE PRONOMEN
=> PRECISION: 0.7009
=> RECALL: 0.6532
The rows and colums are labled as above. (The considerations are restricted to pronominal anaphors.) Precision and recall are defined as
P = (++/(++ + +- + +?))
R = (++/(++ + +- + +? + +_))
As expected, compared to the above figures, the interpretation quality slightly worsens. The precision, however, still lies above the 65% mark, with an average above 70%. The low recall results for first and second person pronouns are due to the difficulty of appropriately interpreting such expressions that occur in in-line quoted speech.