Towards Explainable AI: Significance Tests for Neural Networks Kay Giesecke Advanced Financial Technologies Laboratory Stanford University people.stanford.edu/giesecke/ fintech.stanford.edu Joint work with Enguerrand Horel (Stanford) 1 / 27
Introduction Neural networks underpin many of the best-performing AI systems, including speech recognizers on smartphones or Google’s latest automatic translator The tremendous success of these applications has spurred the interest in applying neural networks in a variety of other fields including finance, economics, operations, marketing, medicine, and many others In finance, researchers have developed several promising applications in risk management, asset pricing, and investment management 2 / 27
Literature First wave: single-layer nets Financial time series: White (1989), Kuan & White (1994) Nonlinearity testing: Lee, White & Granger (1993) Economic forecasting: Swanson & White (1997) Stock market prediction: Brown, Goetzmann & Kumar (1998) Pricing kernel modeling: Bansal & Viswanathan (1993) Option pricing: Hutchinson, Lo & Poggio (1994) Credit scoring: Desai, Crook & Overstreet (1996) Second wave: multi-layer nets (deep learning) Mortgages: Sirignano, Sadhwani & Giesecke (2016) Order books: Sirignano (2016), Cont and Sirignano (2018) Portfolio selection: Heaton, Polson & Witte (2016) Returns: Chen, Pelger & Zhu (2018), Gu, Kelly & Xiu (2018) Hedging: Halperin (2018), B¨ uhler, Gonon & Teichmann (2018) Optimal stopping: Becker, Cheridito & Jentzen (2018) Treasury markets: Filipovic, Giesecke, Pelger, Ye (2019) Real estate: Giesecke, Ohlrogge, Ramos & Wei (2019) Insurance: W¨ uthrich and Merz (2019) 3 / 27
Explainability The success of NNs is largely due to their amazing approximation properties, superior predictive performance, and their scalability A major caveat however is model explainability : NNs are perceived as black boxes that permit little insight into how predictions are being made Key inference questions are difficult to answer Which input variables are statistically significant? If significant, how can a variable’s impact be measured? What’s the relative importance of the different variables? 4 / 27
Explainability matters in practice This issue is not just academic; it has slowed the implementation of NNs in financial practice, where regulators and other stakeholders often insist on model explainability Credit and insurance underwriting (regulated) Transparency of underwriting decisions Investment management (unregulated) Transparency of portfolio designs Economic rationale of trading decisions 5 / 27
This paper We develop a pivotal test to assess the statistical significance of the input variables of a NN Focus on single-layer feedforward networks Focus on regression setting We propose a gradient-based test statistic and study its asymptotics using nonparametric techniques Asymptotic distribution is a mixture of χ 2 laws The test enables one to address key inference issues: Assess statistical significance of variables Measure the impact of variables Rank order variables according to their influence Simulation and empirical experiments illustrate the test 6 / 27
Problem formulation Regression model Y = f 0 ( X ) + ǫ X ∈ X ⊂ R d is a vector of d feature variables with law P f 0 : X → R is an unknown deterministic C 1 -function X , E ( ǫ ) = 0 , E ( ǫ 2 ) = σ 2 < ∞ ǫ is an error variable: ǫ = | To assess the significance of a variable X j , we consider sensitivity-based hypotheses: � � ∂ f 0 ( x ) � 2 H 0 : λ j := d µ ( x ) = 0 ∂ x j X H A : λ j � = 0 Here, µ is a positive weight measure A typical choice is µ = P and then λ j = E [( ∂ f 0 ( X ) ∂ x j ) 2 ] 7 / 27
Intuition Suppose the function f 0 is linear (multiple linear regression) d � f 0 ( x ) = β k x k k =1 Then λ j ∝ β 2 j , the squared regression coefficient for X j , and the null takes the form H 0 : β j = 0 ( → t -test) In the general nonlinear case, the derivative ∂ f 0 ( x ) depends on ∂ x j X ( ∂ f 0 ( x ) � ∂ x j ) 2 d µ ( x ) is a weighted average x , and λ j = 8 / 27
Neural network We study the case where the unknown regression function f 0 is modeled by a single-layer feedforward NN A single-layer NN f is specified by a bounded activation function ψ on R and the number of hidden units K : K � b k ψ ( a 0 , k + a ⊤ f ( x ) = b 0 + k x ) k =1 where b 0 , b k , a 0 , k ∈ R and a k ∈ R d are to be estimated Functions of the form f are dense in C ( X ) (they are universal approximators ): choosing K large enough, f can approximate f 0 to any given precision 9 / 27
Neural network: d = 4 features, K = 3 hidden units 10 / 27
Sieve estimator of neural network We use n i.i.d. samples ( Y i , X i ) to construct a sieve M-estimator f n of f for which K = K n increases with n We assume f 0 ∈ Θ = class of C 1 functions on d -hypercube X with uniformly bounded Sobolev norm Sieve subsets Θ n ⊆ Θ generated by NNs f with K n hidden units, bounded L 1 norms of weights, and sigmoid ψ The sieve M-estimator f n is the approximate maximizer of the empirical criterion function L n ( g ) = 1 � n i =1 l ( Y i , X i , g ), n where l : R × X × Θ → R , over Θ n : L n ( f n ) ≥ sup L n ( g ) − o P (1) g ∈ Θ n 11 / 27
Neural network test statistic The NN test statistic is given by � ∂ f n ( x ) � � 2 λ n j = d µ ( x ) = φ [ f n ] ∂ x j X We will use the asymptotic ( n → ∞ ) distribution of λ n j for testing the null since a bootstrap approach would typically be too computationally expensive Asymptotic distribution of f n 1 Functional delta method 2 In the large- n regime, due to the universal approximation property, we are actually performing inference on the “ground truth” f 0 ( model-free inference ) 12 / 27
Asymptotic distribution of NN estimator Theorem Assume that dP = ν d λ for bounded and strictly positive ν The dimension K n of the NN satisfies K 2+1 / d log K n = O ( n ) , n The loss function l ( y , x , g ) = − 1 2 ( y − g ( x )) 2 . Then ⇒ h ⋆ r n ( f n − f 0 ) = in (Θ , L 2 ( P )) where d +1 n � � 2(2 d +1) r n = log n and h ⋆ is the argmax of the Gaussian process { G f : f ∈ Θ } with mean zero and Cov ( G s , G t ) = 4 σ 2 E ( s ( X ) t ( X )) . 13 / 27
Comments r n is the estimation rate of the NN (Chen and Shen (1998)): E X [( f n ( X ) − f 0 ( X )) 2 ] = O P ( r − 1 n ) assuming the number of hidden units K n is chosen such that K 2+1 / d log K n = O ( n ) n Proof uses empirical process arguments Estimation rate implies tightness of h n = r n ( f n − f 0 ) Rescaled and shifted criterion function converges weakly to Gaussian process Gaussian process has a unique maximum at h ⋆ Argmax continuous mapping theorem 14 / 27
Asymptotic distribution of test statistic Theorem Under the conditions of Theorem 1 and the null hypothesis, � ∂ h ⋆ ( x ) � � 2 r 2 n λ n j = ⇒ d µ ( x ) ∂ x j X 15 / 27
Empirical test statistic Theorem Assume µ = P so that the test statistic �� ∂ f n ( X ) � 2 � λ n j = E X . ∂ x j Under the conditions of Theorem 1 and the null hypothesis, the empirical test statistic satisfies n �� ∂ h ⋆ ( X ) � ∂ f n ( X i ) � 2 � 2 � r 2 n n − 1 � = ⇒ E X ∂ x j ∂ x j i =1 16 / 27
Identifying the asymptotic distribution Theorem Take µ = P. Let { φ i } be an orthonormal basis of Θ . If that basis is C 1 and stable under differentiation, then ∞ α 2 �� ∂ h ⋆ ( X ) B 2 � 2 � i , j � χ 2 = i , E X χ 2 d 4 ∂ x j � ∞ i i i =0 i =0 d 2 i where { χ 2 i } are i.i.d. samples from the chi-square distribution, and where α i , j ∈ R satisfies ∂φ i ∂ x j = α i , j φ k ( i ) for some k : N → N , and the d i ’s are certain functions of the α i , j ’s. 17 / 27
Implementing the test Truncate the infinite sum at some finite order N Draw samples from the χ 2 distribution to construct a sample of the approximate limiting law Repeat m times and compute the empirical quantile Q N , m at level α ∈ (0 , 1) of the corresponding samples If m = m N → ∞ as N → ∞ , then Q N , m N is a consistent estimator of the true quantile of interest Reject H 0 if λ n j > Q N , m N (1 − α ) such that the test will be asymptotically of level α : λ n � � P H 0 j > Q N , m N (1 − α ) ≤ α 18 / 27
Simulation study 8 variables: X = ( X 1 , . . . , X 8 ) ∼ U ( − 1 , 1) 8 Ground truth: Y = 8 + X 2 1 + X 2 X 3 + cos( X 4 ) + exp( X 5 X 6 ) + 0 . 1 X 7 + ǫ where ǫ ∼ N (0 , 0 . 01 2 ) and X 8 has no influence on Y Training (via TensorFlow): 100,000 samples ( Y i , X i ) Validation, Testing: 10,000 samples each Out-of-sample MSE: Model Mean Squared Error 3 . 1 · 10 − 4 ∼ Var( ǫ ) NN with K = 25 Linear Regression 0.35 19 / 27
Linear model fails to identify significant variables Variable coef std err t P > | t | const 10.2297 0.002 5459.250 0.000 1 -0.0031 0.003 -0.964 0.335 2 0.0051 0.003 1.561 0.118 3 -0.0026 0.003 -0.800 0.424 4 0.0003 0.003 0.085 0.932 5 0.0016 0.003 0.493 0.622 6 -0.0033 0.003 -1.035 0.300 7 0.0976 0.003 30.059 0.000 8 -0.0018 0.003 -0.563 0.573 Only the intercept and the linear term 0 . 1 X 7 are identified as significant. The irrelevant X 8 is correctly identified as insignificant. 20 / 27
Recommend
More recommend