1 Introduction
Textual Entailment (TE) is a directional relation between two text fragments, one called Text (T) and the other Hypothesis (H). The relation holds if, typically, a human reading T would infer that H is most likely true. As an example, consider the following text fragments where, according to the definition of entailment, only H1 can be inferred from T:
-
T: Hubble is a telescope that revolves around Earth
-
H1: Hubble is a scientific instrument that orbits Earth
-
H2: Hubble was launched into orbit in 1990
In 2004, Recognizing Textual Entailment (RTE) (Dagan, Glickman and Magnini Reference Dagan, Glickman and Magnini2005) was proposed as a generic task to capture such inferences. In its original formulation, given two text fragments, the task consists in establishing whether the meaning of the Hypothesis can be inferred from the Text. Many Natural Language Processing applications can benefit from this taskshort (Dagan et al. Reference Dagan, Dolan, Magnini and Roth2009). In summarization, for example, a summary should be entailed by the text; in question answering and information retrieval, the answer for a query must be entailed by the supporting snippet of text.
Using machine learning techniques for RTE is limited by the amount of available labeled data and researchers have had to consider the issue of how the dimensionality of the feature space should vary with respect to the sample size. Intuitively, an imbalance between the number of features used and the number of labeled samples can make classification more difficult. As a consequence, when working with data sets as small as RTE-3 (Giampiccolo et al. Reference Giampiccolo, Magnini, Dagan and Dolan2007), many researchers use heuristic feature selection methods in order to reduce the feature space and obtain good feature combinations. The solutions that have been proposed, however, while facilitating support with such data sets, might not work well with new data sets that consist of many samples, for which it may be convenient to use a greater number of more complex features.
In this paper, we present AdArte (A transformation-driven approach for RTE), a system that combines syntactic transformations and machine learning techniques to establish the entailment relations in T–H pairs. Unlike other existing methods, AdArte is designed to fit well with a new kind of data sets that are larger but less complex than data sets used in the past.
The algorithm has been tested on two data sets: the SICK data set, which was used at SemEval-2014 Task#1, and the EXCITEMENT English data set, which is a new data set composed of email feedback sent in by the customers of a railway company. While both data sets are of a larger order of magnitude (higher number of annotated samples) than RTE-3, they also present lower levels of complexity, in different ways.
SICK consists of pairs created by applying a predefined set of syntactic and lexical rules to its T and H pairs. The rules can preserve the meaning (e.g., turning active sentences into passive), create contradiction (e.g., replacing words with semantic opposites) or create unrelated pairs (e.g., switching and mixing modifiers). To clarify these differences between SICK and RTE, consider the pair T:A man is driving a car, H:The car is being driven by a man extracted from SICK, and the one extracted from RTE: T:US Steel could even have a technical advantage over Nucor since one new method of steel making it is considering, thin strip casting, may produce higher quality steel than the Nucor thin slab technique, H:US Steel may invest in strip casting. It is clear that in the case of the RTE pair syntax alone is not sufficient to determine entailment as in the case of the SICK pair, and that some degree of text understanding is required.
The use of such a procedure for the creation of SICK means that there is a correspondence between the rules used to produce a pair (e.g., creating contradictions) and the entailment label assigned to the pair (e.g., contradiction). AdArte learns a set of rules equivalent to those above in order to make an inference judgment. In short, it models the entailment relation as a classification problem where the individual T–H pairs are first represented as a sequence of syntactic operations, called transformations, needed to transform the Text (T) into the Hypothesis (H), and then used as features in a supervised learning classifier that labels the pairs as positive or negative examples. The transformations are calculated by applying tree edit distance on the dependency trees of the T–H pairs, while some knowledge resources, such as WordNet, are used for recognizing cases where T and H use different textual expressions (e.g., home versus house) that still preserve entailment. For classification, we have used Weka (Hall et al. Reference Hall, Frank, Holmes, Pfahringer, Reutemann and Witten2009), a collection of machine learning algorithms that enabled us to try different classifiers, like Random Forests and Support Vector Machines (SVM).
Unlike SICK, the EXCITEMENT English data set has not been created by applying any specific kind of rules to its T–H pairs. However, the data set still maintains a low level of complexity due to its short pieces of text, which do not require a high degree of text understanding in order to be annotated.
AdArte is simple to understand and implement but also effective when compared with the systems that took part in SemEval-2014 Task#1 (Marelli et al. Reference Marelli, Bentivogli, Baroni, Bernardi, Menini and Zamparelli2014a), and the ones distributed as part of the EXCITEMENT Open Platform (EOP) (Padó et al. Reference Padó, Noh, Stern, Wang and Zanoli2013; Magnini et al. Reference Magnini, Zanoli, Dagan, Eichler, Neumann, Noh, Padó, Stern and Levy2014), which is an open source software platform for RTE relations. AdArte has been made available with the EOP for research use and evaluation.
The paper is organized as follows: Section 2 discusses the related work, while Section 3 describes the data sets used for the system evaluation. Section 4 introduces the transformations and Section 5 goes on to describe the system architecture. Section 6 moves to the experimental part and finally Section 7 presents the conclusions.
2 Related work
TE systems tend to follow the same fundamental approach, which consists in representing the meaning of the Text (T) and the Hypothesis (H) at a certain level and checking if the representation of the meaning of H is contained in that of T. Generally, this is done in one of two ways: (i) purely lexical approaches for which, in the simplest case, the likelihood of entailment is based on lexical overlap between the words of the Text and those of the Hypothesis, (ii) deeper proof-theoretic approaches which use deductive proof mechanisms to perform the entailment judgment. The more successful systems for RTE can be classified into several different categories based on the model they use. In this section, we divide the models into the following categories: alignment-based models, transformation-based models and proof-theoretic models. For a more detailed analysis of the principal techniques that have been developed so far, see Dagan et al. (Reference Dagan, Roth, Sammons and Massimo Zanzotto2013).
The best performing system at the SemEval-2014 Task#1 was the system proposed by the University of Illinois (Lai and Hockenmaier Reference Lai and Hockenmaier2014). It combines non-compositional features (e.g., word overlap and alignment between the fragments), compositional features and denotational similarities.
Edit distance models can be viewed as transformation-based models with very simple rules. They were first adopted for RTE in EDITS (Kouylekov and Magnini Reference Kouylekov and Magnini2005). In this approach, the distance values calculated on the trees of the T–H pairs in the development set are used to compute a threshold value for separating the entailment pairs from the non-entailment ones. If the distance value between T and H in an unannotated pair is below or equal to the calculated threshold, then there exists entailment between T and H; the relation is non-entailment otherwise. As opposed to this method, we use the transformations produced by tree edit distance as features to create a feature vector for every T–H pair.
Heilman and Smith (Reference Heilman and Smith2010) suggested an approach using transformations which has some superficial similarities to ours; they used tree edit models to represent sequences of tree transformations and then applied the logistic regression classification model by Hastie, Tibshirani and Friedman (Reference Hastie, Tibshirani and Friedman2001) on the labels and features of the extracted sequences. Our method differs from their approach in three important aspects: (i) They extended the three basic edit operations with additional and more complex ones, like the operation that involves moving whole subtrees. Next, to efficiently search for a reasonable sequence of transformations, they used greedy best-first search and the computationally expensive Tree Kernel-based heuristic function. In contrast, we use tree edit distance to extract three simple types of transformations (i.e., insertion, deletion, substitution), where the number of produced transformations is the minimum number of transformations required to transform one tree into another. (ii) They selected a well-defined set of 33 features that they considered to be the most promising with the specific data set. Instead, we use all the transformations produced, without any specific manual or semi-automatic feature selection dependent on the data set. In addition, the feature set used is homogeneous: all the features represent a type of transformation. (iii) While they used logistic regression for classification, we used other classifiers that performed better on our data sets and with our feature sets. Our method has the following advantages with respect to theirs: it appears simpler to implement and easier to adapt to new data sets, and is more efficient in terms of computational time. The features can be extracted by using one of the many available implementations of tree edit distance and no feature selection is required; there are no constraints to be set or integrated. In contrast, we think that their method would outperform our method when tested on data sets like RTE-3. In fact, the high number of features produced by our method, combined with the low number of available labeled samples, would make classification more difficult in this case.
Transformation-based models, which are a step forward toward proof-theoretic models, were suggested by Harmeling (Reference Harmeling2009) and Stern and Dagan (Reference Stern and Dagan2012). Harmeling’s proposal was a system for TE based on probabilistic transformation. Supposing that there may be multiple transformation sequences (i.e., derivations) that derive a given Hypothesis from a given Text, the derivation with the highest probability should be selected. An entailment label for such a T–H pair can be determined by selecting a threshold (such as p ⩾ 0.5) above which entailment is considered to hold. Our method differs in three ways from their work: (i) Harmeling defined a fixed set of operations, including syntactic transformations, WordNet-based substitutions, and more heuristic transformations such as adding or removing a verb or even substituting a named entity. In contrast, in our approach, the transformations are extracted automatically and their number and type depends on the syntactic structure of the given pairs. (ii) He applies the transformations in a predefined order, while we execute them in an order so as to produce a minimum number of transformations. (iii) In Harmeling’s approach, the probability of each predefined transformation is estimated from the development set, and the probability that a transformation sequence is consistent with T (i.e., the entailment holds) is defined as the product of the probability of each transformation in the sequence. Instead, our approach uses a vector space model for representing the T–H pairs as vectors of transformations, and machine learning-based classifiers to assign an entailment label to such vectors.
In Stern and Dagan’s proposal, they apply the transformations derived from knowledge and linguistic resources, such as WordNet and Wikipedia, to preserve the meaning of the text, as follows: when a transformation is applied on a text T, it transforms it into a new text, T′. The goal is that the meaning of T′ will be entailed from T. Then, they estimate the validity of every transformation and, consequently, of every sequence of transformations, and decide whether T entails H based on this estimation. There are two main differences compared to our method: (i) in their method, the transformations are extracted from external resources whereas we extract them from the data set. (ii) They apply the calculated transformations as a sequence of operations in order to determine whether H can be obtained from T while we use the transformations as features for the classifier. We believe our method to be more reliant on the data set for capturing the relevant syntactic phenomena that determine whether an entailment relation holds. Learning from these transformations requires a number of annotated examples which might not be available in some application domains. Their method appears to be able to learn from a smaller number of examples, and might be more suitable for domain adaptation, when the data distribution in the test domain is different from that in the training domain. However, unlike our approach, which allows a relatively easy adaptation to new languages (provided that annotated data is available), the portability of their approach to languages other than English is uncertain, as the same linguistic resources might not be available for many other languages.
More formal proof-theoretic models encode facts in a knowledge base using a formal representation, such as propositional or first-order logic, and apply rules of inference to determine the set of facts that can be derived from the knowledge base. Systems that use theorem-proving as their conceptual basis were suggested by Bos and Markert (Reference Bos and Markert2005) and MacCartney (Reference MacCartney2007).
Finally, it is worth mentioning a recent study by Bowman et al. (Reference Bowman, Angeli, Potts and Manning2015) using neural networks. Training a neural network on benchmarks such as SICK results in poor performance due to the even larger amount of data required. To face this issue, they trained the neural networks in a transferable way: they used the tuned parameters calculated on a new and large data set (i.e., the Stanford Natural Language Inference data set) to initialize the parameters of a model on the target SICK data set. Even though the measured values are slightly divergent from the state-of-art, they obtained the best performance yet reported for an unaugmented neural network model.
3 Data sets
We used two data sets for the system evaluation: SICK and EXCITEMENT. SICK is the data set used at SemEval-2014 Task#1. Its new release, of which we had a first preview, includes additional information about the syntactic and lexical rules used for creating its T–H pairs. This allowed us to evaluate our method while also considering the different phenomena taking place within the data set (e.g., evaluation on the T–H pairs where H has been created by the annotators transforming T into passive rather than replacing its words with semantic opposites). On the other hand, the EXCITEMENT English data set is a new data set containing email feedback from the customers of a railway company. Details of these data sets are given in Sections 3.1 and 3.2.
3.1 SICK data set
The SICK data set (Marelli et al. Reference Marelli, Menini, Baroni, Bentivogli, Bernardi and Zamparelli2014b) was used at SemEval-2014 Task#1 for the subtasks of: (i) relatedness (i.e., predicting the degree of semantic similarity between two sentences), and (ii) entailment (i.e., detecting the entailment relation between two sentences).
SICK consists of 9,840 English annotated T–H pairs: 4,934 as part of the development set and 4,906 in the test set. Examples of such T–H pairs can be seen in Table 1.
To produce the data set, the annotators randomly chose 750 images from the 8K ImageFlickr data setFootnote 1 and then sampled two descriptions from each of them, while another 750 sentence pairs were extracted from the SemEval-2012 STS MSR-Video Descriptions data set.Footnote 2 Next, they applied a fours-step process consisting of:
-
• Normalization: from the original sentences (S0), new sentences (S1) were produced by excluding or simplifying instances that contained lexical, syntactic or semantic phenomena such as named entities, dates, numbers, multi-word expressions or normalizing the tense.
-
• Expansion: each of the normalized sentences (S1) was expanded in order to obtain (i) a sentence with a similar meaning (S2) which means the overall meaning of the sentence does not change, (ii) a sentence with contrasting meaning (S3) and (iii) a sentence that contains most of the same lexical items, but has a different meaning (S4).
-
• Pairing: S2, S3 and S4 were paired with the S1 sentence from which they were created. Then, S2, S3 and S4 were paired with the other S1 sentences in the original pair. Finally, sentences from different original pairs S0 were randomly paired.
-
• Annotating: the pairs produced in the previous step were finally annotated with one of three possible gold labels: entailment, contradiction (i.e., the negation of the conclusion is entailed from the premise) and neutral (i.e., the truth of the conclusion cannot be determined on the basis of the premise).
The fact that this procedure has been applied for the creation of the data set means that there is a correspondence between the syntactic and lexical rules applied to produce a pair and the entailment label assigned to the pair. The entailment label is mostly assigned in the case of S1–S2 pairs (similar meaning), the contradiction label in the case of S1–S3 pairs (contrast/contradiction), and the neutral label in the case of S1–S4 pairs (lexical overlap only). However, the data set also contains a relatively high proportion of S1–S3 pairs (and, even if to a lesser extent, of S1–S2 pairs) labeled neutral. Inspection of the neutral pairs reveals a significantly higher incidence of pairs of sentences with subjects with indefinite papers. For example, T:A couple is not looking at a map does not contradict H:A couple is looking at a map, because one could imagine two different couples, one who looks at a map, while the other one does not. In contrast, T:A person on a bike is not in the air near a body of water contradicts H:A person on a bike is in the air near a body of water, because in this case, the people mentioned in the two sentences were thought to be the same person. As it is discussed in Section 6.5, this fact impacts the quality of the system’s output. Another aspect that is worth highlighting is that the data set presents some ‘flaws’ that both non-compositional and compositional systems at SemEval-2014 Task#1, took advantage of. In particular, one of them reports that, just by checking the presence of negative words, one can detect 86.4 per cent of the contradiction pairs, and by combining Word Overlap and antonyms one can detect 83.6 per cent of neutral pairs and 82.6 per cent of entailment pairs. Another participant reports that a surprisingly helpful feature was the sentence pair’s ID. In this respect, our approach does not make use of any rule or feature created ad hoc for that purpose. The features used and their number is determined automatically by our method according to the characteristics of the data set. For example, as mentioned before, there is no doubt that the presence of a relation between a negation word and the word it modifies is important in detecting the contradiction pairs; and it is equally true that our method can recognize and exploit such a phenomenon for the pairs classification. However, this must be regarded as a characteristic of our approach that lies in the ability to learn from the various lexical and syntactic phenomena which arise within the data set. This happens without the need for hand-written rules, which might be designed to work on a specific data set, but might fail when attempting to annotate new and different data sets.
3.2 EXCITEMENT english data set
In order to verify the applicability of our method to a data set other than SICK, and to compare our results with several other existing systems, we used the EXCITEMENT english development data set (Kotlerman et al. Reference Kotlerman, Dagan, Magnini and Bentivogli2015), which is freely available from the EOP web site.Footnote 3 EXCITEMENT contains pieces of email feedback sent by the customers of a railway company where they state reasons for satisfaction or dissatisfaction with the company. Table 2 shows an example of T–H pairs annotated with one of the two possible entailment relations in the data set, i.e., entailment and non-entailment. Working with this data set is different from annotating the RTE data sets in many aspects. As regards, the preprocessing, capitalization and punctuation marks in the data set are often missing. Moreover, it contains a lot of imperatives (e.g., improve the food), and noun phrases (e.g., Passageway too narrow), which may deteriorate the performance of methods that use deep linguistic analysis annotations, as our method does.
The data distribution was generated manually by its annotators by organizing the emails into groups based on the nineteen different topics they relate to, like food choice and Internet, and then splitting them to generate four different data sets, namely mix-balanced, mix-unbalanced, pure-balanced and pure-unbalanced.Footnote 4 For these sets, the distinction between balanced and unbalanced data sets lies in whether the data set contains a comparable number of examples per category or a large amount of examples from only one category. The terms mix and pure refer to how the T–H pairs are distributed among the data sets. In the case of the Mix data set, the T–H pairs referring to a specific topic (e.g., food choice) are equally distributed between the training and test data set (i.e., both the training and test contain examples of food choice). In contrast, in the pure data set, the training and the test set contain pairs referring to different topics (e.g., the training could contain the examples for food choice, whereas the test, the ones for Internet).
The reason for creating this split is that it sheds light on how different RTE systems perform under different conditions (e.g., a system might perform better on balanced data sets, while another system could be insensitive to that).
4 Transformations
Transformations are elementary edit operations needed to transform one text into another. The most widely used edit operations are deleting, replacing and inserting pieces of text. In the context of TE, the transformations of a T–H pair are the edit operations needed to transform T into H. For example, given the following T–H pair extracted from SICK:
-
label: NEUTRAL
-
T:The girl is spraying the plants with water
-
H:The boy is spraying the plants with water
the transformation consisting in replacing girl with boy is just one of the possible transformation sequences that can be applied to transform T into H. In fact, other sequences would be equally possible, like the sequence of atomic operations that would remove the whole content of T and add the whole content of H.
4.1 Calculating the transformations
Several approaches to RTE operate directly at the token level, generally after applying some preprocessing, such as part-of-speech (PoS) tagging, but without computing more elaborate syntactic or semantic representations. Another approach, like the one adopted in the present work, is to work at the syntax level by using dependency grammar parsers (Kubler et al. Reference Kubler, McDonald, Nivre and Hirst2009). The output of a dependency grammar parser is a tree whose nodes are the words in the sentence and whose labeled edges correspond to the syntactic dependencies between words. Figure 1 shows the dependency trees for the Text (T) and Hypothesis (H) introduced above.
Each node in the tree consists of the word itself, the lemmatized word with its PoS tag, and the syntactic dependency relation (dprel) with its parent node. We use MaltParser (Nivre, Hall and Nilsson Reference Nivre, Hall and Nilsson2006) trained on parts 2–21 of the Wall Street Journal section of the Penn Treebank to produce the dependency trees, while tree edit distance (see Section 5.1) is applied on the resulting trees to obtain the transformations needed to convert one tree into another. The tree edit distance problem has a recursive solution that decomposes the trees into subtrees and subforests. The direct implementation of this recursive solution has exponential complexity. To limit time and space complexity, we adopt the implementation described in the paper by Zhang and Shasha (Reference Zhang and Shasha1989) which uses dynamic programming to achieve polynomial runtime. Regardless of the implementation used, one of the main characteristics of tree edit distance is that it computes the minimum number of transformations among all the possible sequences that would achieve the same result. In this context, three different types of transformation can be defined:
-
• Inserting: insert a node N from the tree of H into the tree of T. When N is inserted, it is attached with the dprel of the source label.
-
• Deleting: delete a node N from the tree of T. When N is deleted, all its children are attached to the parent of N. It is not required to explicitly delete the children of N as they are going to be either deleted or substituted on a following step.
-
• Replacing: change the label of a node N1 in the source tree (the tree of T) into a label of a node N2 of the target tree (the tree of H).
In addition to these, there is a fourth transformation, called Matching, that does not produce any changes on the nodes (i.e., it is simply applied when two nodes are equal), and which generally is not considered in the edit distance calculation. In our approach, however, the transformations are used as features for the T–H pairs classification, and the Matching transformation, which measures the overlap of the nodes, might be considered complementary to the other transformations and useful in estimating the similarity between the dependency trees of the fragments of text. This concept will be further developed in Section 4.3.
In our implementation, the transformations above are defined at the level of single nodes of the dependency tree (we can insert, remove and replace one node at a time while transformations consisting in inserting, removing and replacing whole subtrees are not allowed) and nodes are matched by considering both their lemma and dprel with their parent node, i.e., two nodes are considered equivalent if and only if they have exactly the same lemma and dprel.
4.2 Transformation form
The transformations are used by the classifier as features and one could envision different ways of representing them. At least two elements need to be specified: the transformation type (e.g., inserting) and the node (or nodes in the case of replacing) involved in the transformation itself. For the nodes, we need to set the level of specificity/genericity; for example, the nodes could be expressed by the lemma of the words (e.g., girl), or by combining the lemma with the PoS (e.g., girl-NN) or combining the lemma and dprel (e.g., girl-nsubj). To avoid overfitting, we adopt a representation in which the nodes in the transformation are only represented by their dependency labels, i.e.,
-
• Replacing(T:node-dprel, H:node-dprel),
-
• Inserting(H:node-dprel),
-
• Deleting(T:node-dprel).
For our running example, this would be: Replacing(T:nsubj, H:nsubj), meaning that a word in T has to be replaced with another word in H and that both these words have a syntactic relation of nsubj with their respective parents.
4.3 The role of the knowledge resources
The use of resources like WordNet is crucial for recognizing cases where T and H use different textual expressions that preserve the entailment (e.g., child versus boy). The Knowledge Resources that we use (see Section 5.2) contain both semantic relations that preserve the entailment relation (e.g., WordNet synonyms and hypernyms) and that change the entailment relation (e.g., WordNet antonyms). With tree edit distance, we use the preserving relations in the following way:
-
WHEN in the Knowledge Resources there exists an entailment-preserving relation between two words, THEN those words are matched as if they were the same word.
As an example, consider the two words toy and object in the pair T:Two dogs are running and carrying a toy in their mouths, H:Two dogs are running and carrying an object in their mouths. For these two words, the following matching transformation would be produced: Matching(T:dobj, H:dobj) Footnote 5 . On the other hand, the non-preserving relations are used for labeling those transformations that, when applied on two words, do not preserve the entailment (e.g., white versus black, leaving versus arriving):
-
WHEN in the Knowledge Resources there exists a non-entailment-preserving relation between two words in a transformation, THEN we label that transformation with the label non-preserving entailment.
As a result, the transformation for our running example, the replacement of girl with boy, which are antonyms in WordNet, can be represented as Replacing(T:nsubj, H:nsubj, non-preserving entailment).
4.4 Transformations produced from SICK
As described in Section 3, the annotators have created T–H pairs with similar meaning, contrasting meaning and pairs where the truth value of one fragment cannot be inferred from the other fragment. Table 3 shows the transformations that our method produces for some examples of T–H pairs with similar meaning, while Table 4 shows the transformations for examples of the two remaining cases.
5 System architecture
We recast RTE as a classification problem in which the provided T–H pairs are classified as positive or negative examples. Figure 2 shows the system’s architecture.
The T–H pairs are first preprocessed (Section 5.1) to produce their dependency trees, then a set of features (i.e., the transformations) are extracted from these trees (Section 5.3). Knowledge Resources (Section 5.2) can be used in this phase to match words with the same or similar meaning and those with opposite meanings. Finally, a classifier uses the features produced (Section 5.4) to classify the T–H pairs to be annotated.
The implementation of the system was developed within the EOP (Padó et al. 2013; Magnini et al. 2014) for RTE, which is a generic architecture and a comprehensive implementation for textual inference in multiple languages that promotes the reuse of software components. The platform includes state-of-the-art algorithms for RTE relations, a large number of knowledge resources and facilities for experimenting and testing innovative approaches.
5.1 Preprocessing
The T–H pairs are preprocessed with PoS tagging, lemmatization and dependency parsing. The output of this phase consists of the dependency trees created for each pair (i.e., one tree for the Text (T) and another for the Hypothesis (H)). We use TreeTagger (Schmid Reference Schmid1994) for tokenization, lemmatization and PoS tagging, while MaltParser (Nivre et al. 2006) is used for dependency parsing. As an example, consider the pair: T:The girl is spraying the plants with water and H:The boy is spraying the plants with water for which the preprocessing output has been reported in Figure 1. This will be used as the running example in this section.
5.2 Knowledge resources
Knowledge resources are crucial in recognizing cases where T and H use different textual expressions that are semantically related and can be relevant in identifying the entailment relation (e.g., kid versus boy, home versus house). In this context, it must be noted that the component we are using for querying the Lexical Knowledge currently has some limitations. First of all, we cannot access the resources by a combination of both lemma and PoS, but by lemma only. As a consequence, it could happen, for example, that a word used as a noun in a certain context and another word used as a verb produce a positive match only because they share the same lemma. Second, resources like WordNet are used without any word sense disambiguation, i.e., a word in the text fragments could be matched with its wrong sense in the resource. In the rest of this section, we briefly report on the three resources used, namely WordNet, CatVar, and VerbOcean, and describe how their semantic relations (e.g., synonym) can be grouped into relations that preserve the entailment and that do not preserve the entailment.
WordNet (Miller Reference Miller1995) is a lexical database that groups words into sets of synonyms called synsets, provides short, general definitions and records the various semantic relations between these synonym sets. From the available relations, we use the following ones: hypernym (e.g., leave → move), entailment (e.g., leave → go), synonym (e.g., leave → go away) and antonym (e.g., leave → arrive). Of these, the hypernym, entailment and synonym are considered to be preserving entailment, and antonym is considered to be non-preserving.
CatVar (Habash and Dorr Reference Habash and Dorr2003) is composed of uninflected words (lemmas) and their categorial (i.e., PoS) variants. For example, the words hunger (V), hunger (N), hungry (AJ) and hungriness (N) are different English variants of the same underlying concept describing the state of being hungry. The local-entailment relation present in CatVar is considered to be an entailment-preserving relation.
VerbOcean (Chklovski and Pantel Reference Chklovski and Pantel2004) is similar to WordNet, but applies only to verbs. VerbOcean contains lexical–semantic relations that are unique to verbs, such as stronger than (e.g., whisper → talk). Many of these relations indicate entailment at the lexical-level, and are useful for covering inference relations between verbs that do not exist in WordNet. The relations used from VerbOcean are similar, which preserves entailment, and opposite of, which is a non-preserving relation.
5.3 Feature extraction
Feature extraction is used in order to convert each input pair into a feature set. We represent each T–H pair as a set of binary features which are the transformations obtained by applying tree edit distance on the dependency trees produced in the previous phases. To limit time and space complexity, we use the implementation described in the paper by Zhang and Shasha (Reference Zhang and Shasha1989), as said in Section 4.1.
5.4 Classification
The availability of annotated data sets makes it possible to formulate the problem of RTE in terms of a classification task. For classification, we used Weka (Hall et al. 2009), which is a collection of machine learning algorithms for data mining tasks. Using Weka, we decided to try different algorithms, like Naive Bayes (Rish Reference Rish2001), logistic regression (le Cessie and van Houwelingen Reference le Cessie and van Houwelingen1992), J48 (i.e., an open source Java implementation of the C4.5 algorithm (Quinlan Reference Quinlan1986) in the Weka data mining tool), Random Forests (Breiman Reference Breiman2001) and SVM (Vapnik Reference Vapnik1995).
6 Experiments and results
We evaluated our approach by testing it on both the SICK and EXCITEMENT data sets. As regards SICK, the experiments have been conducted in the same way as the ones performed for SemEval-2014 Task#1. This means that we split the development set into two parts: a dev-train for training the system (3,290 examples) and a dev-test (1,644) for finding the best system configuration to be used for the test set. Achieving the best system configuration required a number of experiments: Section 6.1 describes the baseline and two basic system configurations that were used as reference point to compare other tested configurations. After choosing the best of these configurations, Section 6.2 refers to the experiments conducted using different algorithms for the classification of the T–H pairs, while Section 6.3 refers to the contribution of the knowledge resources to the final system performance. Section 6.4 compares the different approaches to node matching.
With respect to the EXCITEMENT data set, each of its four data sets comes in two parts: a dev-train that we use for training the system and a dev-test for tuning. Given that the test data set was not available at the time we were writing, in this section, we report the results we have obtained on the dev-test. We would like to highlight that the same experiments conducted on SICK were also performed on this data set to find the most promising system configurations. For reasons of space, and considering that the fine-grain annotation of SICK allows for much more refined conclusions, and that the EXCITEMENT test set will be released at a later time, for EXCITEMENT, we show only the results for a number of configurations and this is done in Section 6.5.
Finally, Section 6.6 concludes by comparing our results on SICK with those obtained by the participants of SemEval-2014 Task#1, while for EXCITEMENT, the results are compared with those of the systems distributed with the EOP. For both data sets, we run the experiments on the same hardware infrastructure.Footnote 6
6.1 Baseline and evaluation measures
To measure the performance of each of the system configurations that were tested, we created a baseline calculated by annotating all the T–H pairs in the dev-test with the most common entailment relation in the dev-train (i.e., neutral). Then, we explored two different basic configurations: basic-system#1 was calculated by using SVM and a linear kernel for classification. SVM is one of the most popular classification algorithms used in machine learning, while the linear kernel is faster than other kernel functions and easier to optimize. WordNet, CatVar and VerbOcean were used as resources as they are well-known, freely-available resources. Next, basic-system#2 was obtained in the same way as basic-system#1, but without using any external resources.
In Table 5, the overall system performance over the different category labels is measured in terms of accuracy (i.e., the percentage of predictions that are correct), while, in the rest of the section, the classification effectiveness over the single category is calculated in terms of the classic information retrieval notions of Precision (Pr) and Recall (Re), together with the F1 measure.
Supervised learning methods need annotated data in order to generate efficient models. However, annotated data is relatively scarce and can be expensive to obtain. The learning curve plotted for our method (Figure 3) shows the accuracy on the dev-test as a function of the number of examples in the dev-train. Acceptable results can be obtained using only twenty per cent of the training data (980 examples). Moreover, exceeding fifty per cent of the training examples (2,467 examples) does not produce any significant improvements; it is the case that the new annotated examples cannot solve the most problematic pairs, which will be examined in the next part of this section.
6.2 Algorithms evaluation
Starting from basic-system#1 (SVM), we replaced SVM with other algorithms for classification: Random Forests, J48, Logistic and Naive Bayes; we have chosen these algorithms because they are frequently reported in the literature as having good classification performance. A second degree polynomial kernel for SVM was used in order to try to capture the complex interactions that occur among the syntactic transformations. Table 6 reports on the results obtained when these algorithms are trained on the SICK dev-train and evaluated on the dev-test. The annotation (≫), (≪) or (~) in the t-test field of the table indicates that a specific result is statistically better (≫), worse (≪) or equivalent (~) compared to the results of basic-system#1 (SVM) at the significance level specified (currently 0.05).
According to this test, SVM with polynomial kernel and Random Forests performed better than SVM using linear kernel, while Naive Bayes performed poorly. These results could also be explained by considering the presence of interactions among the features extracted: if each of the transformations makes an independent contribution to the output, then algorithms based on linear functions (e.g., logistic, SVM with linear kernel, Naive Bayes) generally perform well. However, if there are complex interactions among features, as is the case here, then algorithms such as SVM with polynomial kernels of higher degree, decision trees or ensemble methods using decision trees, like Random Forests, work better, because they are specifically designed to discover these interactions.
6.3 Knowledge resources evaluation
To measure the impact of the knowledge resources on recognizing the entailment relations, we started with basic-system#1 (SVM), using all the resources described in Section 5.2 and then removed one lexical resource at a time. This is reported in Table 7. In the table, the coverage of the resources is measured as the number of times we find a relation in a given resource, divided by the number of accesses to the resource.
WordNet resulted as the most important resource (− 2.9 when it is not used), CatVar (− 0.2) gives a small contribution and finally VerbOcean (+0.3) seems to not be useful. The good result of using WordNet can also be explained by considering its coverage, which is higher than that of the other resources. Using WordNet enables, for example, the identification the T–H pairs that had been created by replacing words with synonyms or by replacing words with semantic opposites. Concerning VerbOcean, but this is also true for all the other resources, we must point out once again that the EOP component that we are currently using for querying the Lexical Knowledge does not allow access to the resources by a combination of both lemma and PoS, but by lemma only. As a direct consequence of this, it could happen, for example, that, even though VerbOcean contains semantic relations only between verbs, words used as a noun in a certain context, and those that are actually verbs, might produce a positive match, which is used in the tree edit distance calculation. As an example of this problem, consider the pair T:A little girl is hitting a baseball off a tee, H:A little girl in a pink shirt is playing t-ball and taking a swing, where the verb hit in the Text (T) and the noun swing in the Hypothesis (H) create a positive match. The same problem can occur when accessing WordNet; for the pair T:A monkey is pulling a dog’s tail, H:A monkey is teasing a dog at the zoo, the noun tail in T is matched with the noun dog in H, because they are synonyms in WordNet but with the meaning of ‘to follow’ (i.e., a verb).
6.4 Node matching evaluation
Tree edit distance computes the transformations working on the nodes in the trees and different transformations can be produced under different types of node matching. In basic-system#1 (SVM), two nodes are considered equivalent if and only if they have the exact same lemma and dprel. However, other strategies for comparing nodes are possible. Using only the lemma results in a significant drop in accuracy (− 20.1); in this case, for example, a lemma that is the subject of a Text (T), can be successfully matched by tree edit distance against the same lemma playing another syntactic role in a Hypothesis (H) (e.g., the object). On the other hand, matching the nodes by only considering the syntactic relations lets tree edit distance match nodes that have the same syntactic role in the text fragments (e.g., the subject) but which, in fact, represent different words (e.g., man versus car, woman versus potato); this may also explain the loss in accuracy (− 12.3).
6.5 System evaluation
In the previous sections, we have reported a large number of experiments in order to find the best system configuration on the dev-test. When using that configuration to train our system on the whole development set and then testing it on the test data set, we obtained 82.4 per cent accuracy. Table 8 gives details about the system’s performance for all the three entailment relations of SICK i.e., contradiction, entailment and neutral. Table 9 reports the confusion matrix obtained on the SICK data set.
The confusion matrix reveals that in some cases the entailment and contradiction relations can be confused with the neutral one. An explanation for this lies in the higher presence in the data set of sentences with subjects that contain indefinite articles, where the entailment judgments are subject to human interpretation. In fact, in the data set, there are pairs annotated with neutral or contradiction labels depending on whether the subjects of their T and H were believed to refer to the same entity or two distinct ones.
The results, grouped into categories corresponding to the different expansion rules that had been applied, can be found in Table 10. To obtain S2, the annotators applied meaning-preserving rules, while to obtain S3, they used rules creating contradictory or contrasting meaning. Finally, to get sentences S4, a set of word scrambling rules were applied while ensuring that the resulting sentence was still meaningful. The results in the table are only reported for those rules that generate at least ten T–H pairs in the test data set, while sentences S1, S2, S3 and S4 can appear in T and H or vice versa. The annotation of the pairs was done using the primary configuration.
Errors in the system’s output regard the pairs that have been created by replacing words with synonyms and with semantic opposites. Here, a major issue can be the lack of coverage of the knowledge base used for word matching; as an example of words for which we do not have any match in the knowledge base, consider grass and lawn in the pair T:A brown and white dog is playing on the grass, H:A brown and white dog is playing on the lawn. Other examples refer to words used with opposite meaning, like playing and sleeping in T:Children in red shirts are playing in the leaves, H:Children in red shirts are sleeping in the leaves; or the use of partial synonyms like marsh and river in T:A monkey is wading through a marsh, H:A monkey is wading through a river. One instance where one could expect to have better performance is that of the pairs created by turning compounds into relative clauses. Examples of such pairs are T:A dog is pushing a toddler into a rain puddle, H:A dog is pushing a toddler into a puddle of rain, or pairs like T:A person is looking at a bike designed for motocross that is lying on its side and another is racing by, H:A person is looking at a motocross bike that is lying on its side and another is racing by. Even though for the first pair, our approach creates simple transformations (i.e., replacing the noun compound rain in T with the nominal modifier rain in H; and then inserting a new node for the case relation of), the second pair produces a much larger number of transformations, which our system finds hard to process. Another factor contributing to this result is the limited number of annotated examples for this rule.
Regarding EXCITEMENT, we have conducted several experiments on its four configurations. Table 11 reports the results we obtained. In the table, basic-system#1 (SVM) and basic-system#2 (SVM w/o resources) are calculated in the same way as for the SICK experiments. The baseline was calculated by annotating all the T–H pairs with the most common entailment relation in the data sets (i.e., non-entailment).
For this set of experiments, an important point to highlight is that the lexical resources we used had almost no effect. As indicated by comparing basic-system#1 (SVM) (which uses the lexical resources) and basic-system#2 (SVM w/o resources). The reason for this is that their coverage is close to zero and more specific domain-based resources would be necessary. Another issue that comes to light when working on the four splits of the data set is that the performance levels of the classifiers are relatively close to each other, meaning that the transformations can be successfully exploited by most of the classifiers. When we consider the mix-balanced and mix-unbalanced data sets, SVM and Random Forests algorithms perform as well as they did for SICK. SVM, which is known to be robust to overfitting, also obtained good performance on the pure-balanced data set. Finally, working with the pure-unbalanced data set is a demanding task for all the classifiers because the syntactic transformations extracted from the training data set might not accurately represent the data in the test set. Here, a simple classifier like Naive Bayes, not so finely tuned to contingent characteristics of the data set, performs a bit better than the others. It should be noted that obtaining high accuracy values on this split is trivial, since a trivial rejector, i.e., the classifier that trivially assigns all pairs to the most heavily populated category (i.e., non-entailment) as baseline does, can indeed achieve very high accuracy, even though there are no applications in which one would be interested in such a classifier. In this case, F1 values can offer additional insight. However, accuracy values are not the only factor to consider when selecting a classifier. For example, for real-time applications, one could prefer higher values of annotation speed, and might be willing to compromise on accuracy in order to achieve that; in this case, SVM with linear kernel and Random Forests would be preferable to SVM with second-degree polynomial kernel.
6.6 A comparison with other approaches
This section provides a comparison of the results obtained by AdArte with the ones obtained by different methods. For SICK, this is done by comparing AdArte with the Illinois-LH system (Lai and Hockenmaier Reference Lai and Hockenmaier2014) and the other systems at SemEval-2014 Task#1 (see Table 12). In addition, we compared AdArte with systems that, when working on data set such as RTE-3, obtained state-of-the-art results: TIE is a system which combines the outcomes of several scoring functions into a classifier, while P1EDA uses various independent aligners to compute any evidence for or against entailment. Given that the current implementation of these systems does not support multi-class classification (that is required to annotate SICK), we evaluated them on two-class entailment problem. In this case, all the examples in SICK annotated with contradiction and neutral labels were converted to non-entailment.
AdArte’s results are close to those of UNAL-NLP, which was ranked the first among the systems that did not use compositional models. Purely compositional models have lower performance; haLF obtained 69.4 per cent of accuracy, while the system presented by Illinois, which used a combination of the two models, was ranked first (84.6 per cent). The results of TIE (Wang and Neumann Reference Wang and Neumann2007) and P1EDA (Noh, et al. Reference Noh, Padó, Shwartz, Dagan, Nastase, Eichler, Kotlerman and Adler2015) can be explained by considering that these approaches might not work well with the phenomena present in the data set, such as negations, given that their alignment-based methods are not able to capture them. The approach by Bowman et al. (Reference Bowman, Angeli, Potts and Manning2015) shows that the representations learned by a neural network model on a large high-quality corpus such as the Stanford Natural Language Inference corpus can be used to improve performance on a standard challenge data set: 80.8 of accuracy versus 71.3 that is the accuracy when training on SICK alone.
Concerning the EXCITEMENT data sets, our results are compared with those of the other systems implemented within the EOP (see Table 13) (i.e., they were made available in the EOP GitHub repositoryFootnote 7 ), and with SemantiKLUE (Proisl et al. Reference Proisl, Evert, Greiner and Kabashi2014), which was one of the top five systems at SemEval-2014 Task#1, and which has also been tested on this data set.
From the table, it is apparent that in the case of the two data sets built by evenly distributing the T–H pairs of different topics in the dev-train and dev-test (i.e., mix data sets) our method is better than others at learning patterns and regularities from data. In contrast, on the data sets where the examples in the training are not representative of the examples in test (i.e., pure data sets), the difference between our method and the others becomes smaller. The difference is null in the case of the pure-balanced data set, whereas on the pure-unbalanced data, systems can have good accuracy on the majority class (i.e., non-entailment) but very poor accuracy on the minority class (i.e., entailment) due to the influence that the majority class has on the training criteria. In fact, many algorithms for classification are accuracy-driven. In this sense, a system like EditDistance, but whose simple classifier can be configured to maximize the F1 measure during training, produces, on the pure-unbalanced data set, F1 values that are above average compared to the other systems. Good results have been obtained on all the splits by one of the top five systems at SemEval-2014 Task#1, SemantiKLUE (Proisl et al. 2014), which, in addition, proved to be portable to new domains. Finally, it is worth highlighting that the lack of capitalization and punctuation marks information in the data set, and the presence of imperatives and elliptical phrases as well, does not seem to have affected our results with respect to other approaches based on shallow linguistic analysis, for example, the one used to implement edit distance models.
7 Conclusions and outlook
In this paper, we have described an approach to RTE based on tree-level transformations. We have seen how the T–H pairs can be represented by their dependency trees and how the transformations can be calculated from these trees by tree edit distance. The transformations have then been used as features in a classifier to label the pairs as positive or negative. Lastly, the implemented AdArte system has been made publicly available through the EOP, a generic architecture and a comprehensive implementation for textual inference in multiple languages. Unlike other existing systems, AdArte is designed to work with larger data sets. The system produces thousands of transformations and, consequently, requires that a certain number of annotated examples be available to learn a model. In this context, we have evaluated AdArte on two new data sets that present this feature: SICK, used at SemEval-2014 Task#1, and EXCITEMENT, a new data set of email feedbacks.
With respect to SICK, a comparison of our results with those of SemEval-2014 systems reveals promising results: the transformations are able to capture the main syntactic and lexical phenomena in the data set and they can be successfully used with a variety of classifiers. This result is all the more interesting if we consider that some systems at SemEval-2014 use approaches that would be difficult to adapt to new domains such as the EXCITEMENT one (e.g., the denotational features used in the Illinois system are fairly strongly tied to the SICK domain) or more complex machine learning techniques, such as co-training used in ECNU to exploit large amounts of unlabeled data, which are out of the scope of this work.
With regards to the EXCITEMENT data set, we have compared our method with the systems in the EOP. In the case of the two data sets built by evenly distributing the T–H pairs referring to different topics in the dev-train and dev-test (i.e., mix data sets), our method appears better than others at learning patterns and regularities from data. In contrast, on the data sets where the examples in the training are not so representative of those in the test set (i.e., the pure data sets), the difference between our method and the best systems in the EOP becomes smaller or void, revealing a certain degree of overfitting. This phenomenon is even more visible when working with pure-unbalanced data sets, where edit distance models, which learn simple threshold values, outperform other systems.
Despite the satisfactory results, our method has two main types of limitations: limitations regarding the size of the data set and regarding the use of the knowledge resources. The first type of limitation is due to the approach itself, which is suitable for data sets which consist of a sufficient number of examples to learn a model. Possible examples of data sets for which our approach is not indicated could be the RTE-3 data sets which consist of 800 annotated pairs. When tested on this data set, AdArte produces lower accuracy values (58.3 per cent) than other methods tested on the same data set. In fact on the Third PASCAL RTE Challenge (RTE-3), participants reported accuracy values ranging from forty-nine per cent to eighty per cent, while, for example, another system such as P1EDA obtained 67.0 per cent.
Another limitation of our approach stems from cases that cannot be solved by comparing the syntactic structures of the T–H pairs. This is due to the fact that syntax alone is not sufficient to determine entailment in more complex cases that require some degree of text understanding. As an example, consider the following T–H pair, where the existing entailment relation between T:Carl Smith collided with a concrete lamp-post while skating and suffered a skull fracture that caused a coma. When he failed to regain consciousness, his parents on August 8 consented to his life support machine being turned off, and H:Carl Smith died on August 8 cannot be determined by merely considering the syntactic differences between them. Another limitation of this type is the inability to deal with phrasal verbs in the current implementation of the system. One example of such phrasal verbs would be: I am counting on you to make dinner while I am out. In our approach, tree edit distance can only work at the level of single nodes and, as a consequence, it cannot match phrases such as count on with trust or go on with continue. Missing these matches prevents us from using resources like WordNet, which would be able to identify that go on and continue are synonyms and they should match. We are going to address this issue by providing an additional preprocessing phase able to take as input the dependency trees and then collapse the nodes of a phrasal verb into a single node. The last issue refers to the component that we are using for querying the lexical knowledge: it cannot access the resources by a combination of both lemma and PoS, but by lemma only, and, moreover, resources like WordNet are used without any word sense disambiguation. We will also address this issue by using a new software component that has been made available in order to overcome these limitations.
We plan to evaluate AdArte in languages other than English, such as German and Italian; for this purpose, we will use the OMQ and EXCITEMENT Italian data setsFootnote 8 created from manually categorized German and Italian customer requests, and that were recently added into the EOP for evaluation.
Finally, considering the promising results, we obtained on the two data sets tested, we are going to use AdArte in a real-world application developed within the EXCITEMENT project. More precisely, this application aims to provide an easy-to-use interface for analysts to gain insight from a mass of customer interactions through TE. The application takes as input the customer interactions from various service channels of a typical contact center (e.g., transcribed calls and emails). Then, an automatic text fragment extraction is applied on the customer interactions to extract meaningful text fragments such as reasons for dissatisfaction or other customer events that could be interesting for analysis. AdArte will get the set of relevant fragments as input to produce a structured hierarchy based on the entailment relations that hold between these text fragments, which will then be displayed by the application.