Interpretable Machine Learning

Complete Paper List in Reverse Chronological Order

Selected Recent Papers on Interpretable Machine Learning

A. R. Hsu, Y. Cherapanamjeri, A. Y. Odisho, P. R. Carroll, B. Yu (2024). Mechanistic Interpretation through Contexual Decomposition in Transformers. https://arxiv.org/pdf/2407.00886.
Y. Chen, C. Singh, X. Liu, S. Zuo, B. Yu, H. He, J. Gao (2024). Towards consistent natural-language explanations via explanation-consistent finetuning. https://arxiv.org/abs/2401.13986
Q. Zhang, C. Singh, L. Liu, X. Liu, B. Yu, J. Gao, T. Zhao (2023). Tell your model where to attend: post-hoc attention steering for LLMs. ICLR 2024. https://arxiv.org/abs/2311.02262
A. Agarwal, A. M. Kenny, Y. S. Tan, T. M. Tang, B. Yu (2023). MDI+: a flexible random forest-based feature importance framework. https://arxiv.org/abs/2307.01932 (PCS related)
A. R. Hsu, Y. Cherapanamjeri, B. Park, T. Naumann, A. Odisho, and B. Yu (2023). Diagnosing transformers: illuminating feature space for clinical decison-making. ICLR (2024) https://arxiv.org/abs/2305.17588
C. Singh, A. R. Hsu, R. Antonello, S. Jain, A. G. Huth, B. Yu and J. Gao (2023). Explaining black box text modules in natural language with language models.
C. Singh, W. Ha and B. Yu (2021). Interpreting and Improving Deep-Learning Models with Reality Checks. https://arxiv.org/abs/2108.06847 to appear in the book entitled “xxAI - Beyond Explainable AI” (eds. Andreas Holzinger, Randy Goebel, Ruth Fong, Taesup Moon, Klaus-Robert Müller, and Wojciech Samek).
W. Ha, C. Singh, F. Lanusse, S. Upadhyayula, and B. Yu (2021). Adaptive Wavelet Distillation from Neural Networks through Interpretation. Proc. NeurIPS 2021. (code)
L. Reiger, J. W. Murdoch, S. Singh, B. Yu (2020). Interpretations are Useful: Penalizing Explanations to Align Neural Networks with Prior Knowledge. ICML Proceedings. (code)
C. Singh, W. Ha, F. Lanusse, V. Boehm , J. Liu, B. Yu (2020). Transformation Importance with Applications to Cosmology ICLR Workshop paper. (code)
W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu* (2019) Definitions, methods, and applications in interpretable machine learning. PNAS, 116 (44) 22071-22080.
W. J. Murdoch, C. Sign, and B. Yu (2019). Hierarchical interpretations for neural network predictions. ICLR. (code)
J. Murdoch, P. Liu, and B. Yu (2018) Beyond word importance: contextual decomposition to extract interactions from LSTMs. Proc. ICLR 2018. https://arxiv.org/abs/1705.07356 (code)