Fixing confirmation bias in feature attribution methods via semantic match

Giovanni Cinà; Daniel Fernandez-Llaneza; Nishant Mishra; Tabea E. Röber; Sandro Pezzelle; Iacer Calixto; Rob Goedhart; Ş. İlker Birbil

Fixing confirmation bias in feature attribution methods via semantic match

Giovanni Cinà, Daniel Fernandez-Llaneza, Nishant Mishra, Tabea E. Röber, Sandro Pezzelle, Iacer Calixto, Rob Goedhart, Ş. İlker Birbil

Research output: Working paper › Preprint › Research

Abstract

Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cin\`a et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.

Original language	Undefined/Unknown
Publication status	Published - 3 Jul 2023

Keywords

cs.AI
cs.LG

Access to Document

https://pure.amc.nl/ws/files/53503666/2307.00897v1.pdf

Cite this

@techreport{6bea3afade4f4f689b1405ef905cf1fe,

title = "Fixing confirmation bias in feature attribution methods via semantic match",

abstract = "Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the {"}semantic match{"} between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cin\`a et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.",

keywords = "cs.AI, cs.LG",

author = "Giovanni Cin{\`a} and Daniel Fernandez-Llaneza and Nishant Mishra and R{\"o}ber, {Tabea E.} and Sandro Pezzelle and Iacer Calixto and Rob Goedhart and Birbil, {{\c S}. İlker}",

year = "2023",

month = jul,

day = "3",

language = "Undefined/Unknown",

type = "WorkingPaper",

}

TY - UNPB

T1 - Fixing confirmation bias in feature attribution methods via semantic match

AU - Cinà, Giovanni

AU - Fernandez-Llaneza, Daniel

AU - Mishra, Nishant

AU - Röber, Tabea E.

AU - Pezzelle, Sandro

AU - Calixto, Iacer

AU - Goedhart, Rob

AU - Birbil, Ş. İlker

PY - 2023/7/3

Y1 - 2023/7/3

N2 - Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cin\`a et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.

AB - Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cin\`a et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.

KW - cs.AI

KW - cs.LG

M3 - Preprint

BT - Fixing confirmation bias in feature attribution methods via semantic match

ER -