Hostname: page-component-5db58dd55d-8mwbx Total loading time: 0 Render date: 2026-06-24T21:44:01.727Z Has data issue: false hasContentIssue false

Evaluating explainable AI attribution methods in neural machine translation via attention-guided knowledge distillation

Published online by Cambridge University Press:  28 May 2026

Aria Nourbakhsh*
Affiliation:
University of Luxembourg, Luxembourg
Salima Lamsiyah
Affiliation:
University of Luxembourg, Luxembourg
Adelaide Danilov
Affiliation:
University of Luxembourg, Luxembourg
Christoph Schommer
Affiliation:
University of Luxembourg, Luxembourg
*
Corresponding author: Aria Nourbakhsh; Email: aria.nourbakhsh@uni.lu
Rights & Permissions [Opens in a new window]

Abstract

The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models, building upon the forward simulation of XAI methods. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student’s ability to simulate targets. Using the Inseq library, we extract attribution scores over source–target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de–en, fr–en, ar–en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $\times$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $\times$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture the alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source–target pair, learns to reconstruct the teacher’s attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.a

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Figure 1. Figure 1 long description.An example of attribution maps derived from different XAI methods. For the source sentence ‘Dann gibt es noch Anbieter, die kaum Fahrraderfahrung, jedoch gute Fernostkontakte haben und so an günstige E-Bikes kommen.’ and the target ‘Then there are suppliers with little or no experience in the bicycle industry but good contacts in the Far East, thus giving them access to low-cost e-bikes.’. In the heatmaps, the rows correspond to source tokens, and the columns to target tokens. The heatmaps are generated from the normalized columns using the MinMax normalizer.

Figure 1

Figure 2. Figure 2 long description.(a) Illustrates the overall design of our approach. The input sequence and the gold output (x,y)$(\mathbf{x}, \mathbf{y})$ are given to a teacher model, and their attributions E$E$ are obtained. Then, a new untrained model is trained using the same (x,y,E)$(\mathbf{x}, \mathbf{y}, E)$ triples. In the testing phase, the model gets the (x,E)→y^$(\mathbf{x}, E)\rightarrow \hat {\mathbf{y}}$. (b) Shows two places where we inject the attributions obtained from XAI methods.

Figure 2

Table 1. BLEU and chrF scores for de-en Marian-MT attributions. Scores followed by Δ$\Delta$ over the baselineTable 1 long description.

Figure 3

Table 2. BLEU and chrF scores for fr-en Marian-MT attributions. Scores followed by Δ$\Delta$ over the baselineTable 2 long description.

Figure 4

Table 3. BLEU and chrF scores for ar-en Marian-MT attributions. Scores followed by Δ$\Delta$ over the baselineTable 3 long description.

Figure 5

Table 4. BLEU and chrF scores for de-en mBART attributions. Scores followed by Δ$\Delta$ over the baselineTable 4 long description.

Figure 6

Table 5. BLEU and chrF scores for fr-en mBART attributions. Scores followed by Δ$\Delta$ over the baselineTable 5 long description.

Figure 7

Table 6. BLEU and chrF scores for ar-en mBART attributions. Scores followed by Δ$\Delta$ over the baselineTable 6 long description.

Figure 8

Table 7. BLEU and chrF scores for de-en Cross-attention Marian-MT attributions (Baseline: 25.82). Scores followed by Δ$\Delta$ over the baselineTable 7 long description.

Figure 9

Table 8. BLEU and chrF scores for fr-en Cross-attention Marian-MT attributions (Baseline: 27.01). Scores followed by Δ$\Delta$ over the baselineTable 8 long description.

Figure 10

Table 9. BLEU and chrF scores for ar-en Cross-attention Marian-MT attributions (Baseline: 40.68). Scores followed by Δ$\Delta$ over the baselineTable 9 long description.

Figure 11

Table 10. BLEU and chrF scores for de-en generated by the Marian-MT model. Scores followed by Δ$\Delta$ over the baselineTable 10 long description.

Figure 12

Table 11. BLEU and chrF scores for fr-en generated by the Marian-MT model. Scores followed by Δ$\Delta$ over the baselineTable 11 long description.

Figure 13

Table 12. BLEU and chrF scores for ar-en generated by the Marian-MT model. Scores followed by Δ$\Delta$ over the baselineTable 12 long description.

Figure 14

Table 13. BLEU scores for random attribution matrices (left) and diagonal attribution matrices (right). Scores followed by Δ$\Delta$ over the baselineTable 13 long description.

Figure 15

Table 14. BLEU and chrF scores for de-en Marian-MT 4 head attribution injection. Scores followed by Δ$\Delta$ over the baselineTable 14 long description.

Figure 16

Table 15. BLEU and chrF scores for fr-en Marian-MT 4 head attribution injection. Scores followed by Δ$\Delta$ over the baselineTable 15 long description.

Figure 17

Table 16. BLEU and chrF scores for ar-en Marian-MT 4 head attribution injection. Scores followed by Δ$\Delta$ over the baselineTable 16 long description.

Figure 18

Figure 3. Figure 3 long description.Column-wise entropy based on Marian-MT attributions.

Figure 19

Figure 4. Figure 4 long description.Column-wise entropy based on mBART attributions.

Figure 20

Figure 5. Figure 5 long description.Column-wise entropy based on Marian-MT attributions (Marian-MT-generated targets).

Figure 21

Figure 6. Figure 6 long description.KL-divergence, Overlap@3, and τ@3$\tau @3$ for the prediction of attribution on Marian-MT attribution and gold (human) data.

Figure 22

Figure 7. Figure 7 long description.KL-divergence, Overlap@3, and τ@3$\tau @3$ for the prediction of attribution on mBART attribution and gold (human) data.

Figure 23

Figure 8. Figure 8 long description.KL-divergence, Overlap@3, and τ@3$\tau @3$ for the prediction of attribution on Marian-MT attribution generated target.

Figure 24

Figure 9. Figure 9 long description.Regression plots of the Marian-MT attributions with gold (human) target sentence.

Figure 25

Figure 10. Figure 10 long description.Regression plots of the mBART attributions with gold (human) target sentence.

Figure 26

Figure 11. Figure 11 long description.Regression plot of the Marian-MT attributions with generated target sentences.

Figure 27

Table 17. Sample translations and metrics for Marian-MT and mBART under different attribution/XAI variants for a sample de-en.Table 17 long description.