File size: 9,831 Bytes
468f744
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
import streamlit as st


def about():
    _, centercol, _ = st.columns([1, 3, 1])
    with centercol:
        st.markdown(
            """
            ## Testing Semantic Importance via Betting

            We briefly present here the main ideas and contributions.
        """
        )

        st.markdown("""### 1. Setup""")
        st.image(
            "./assets/about/setup.jpg",
            caption="Figure 1: Pictorial representation of the setup.",
            use_column_width=True,
        )

        st.markdown(
            """
            We consider classification problems with:
                    
            * **Input image** $X \in \mathcal{X}$.
            * **Feature encoder** $f:~\mathcal{X} \\to \mathbb{R}^d$ that maps input 
            images to dense embeddings $H = f(X) \in \mathbb{R}^d$.
            * **Classifier** $g:~\mathbb{R}^d \\to [0,1]^k$ that separates embeddings 
            into one of $k$ classes. We do not assume $g$ has a particular form and it
            can be any fixed, potentially nonlinear function.
            * **Concept bank** $c = [c_1, \dots, c_m] \in \mathbb{R}^{d \\times m}$ such 
            that $c_j \in \mathbb{R}^d$ is the representation of the $j^{\\text{th}}$ concept.
            We assume thet $c$ is user-defined and that $m$ is small ($m \\approx 20$). 
            * **Semantics** $Z = [Z_1, \dots, Z_m] = c^{\\top} H$ where $Z_j \in [-1, 1]$ represents the 
            amount of concept $j$ present in the dense embedding of input image $X$. 

            For example:

            * $f$ is the image encoder of a vision-language model (e.g., CLIP$^1$, OpenCLIP$^2$).
            * $g$ is the zero-shot classifier obtained by encoding *``A photo of a <CLASS_NAME>''* with the
            text encoder of the same vision-language model.
            * $c$ is obtained similarly by encoding the user-defined concepts.
            """
        )

        st.markdown(
            """
            ### 2. Defining Semantic Importance

            Our goal is to test the statistical importance of the concepts in $c$ for the 
            predictions of the given classifier on a particular image $x$ (capital letters denote random 
            variables, and lowercase letters their realizations). 
            
            We do not train a surrogate, interpretable model and instead consider the original, potentially 
            nonlinear classifier $g$. This is because we want to study the semantic importance of 
            the model that would be deployed in real-world settings and not a surrogate one that 
            might decrease performance.

            We define importance from the perspective of conditional independence testing because
            it allows for rigorous statistical testing with false positive rate control
            (i.e., Type I error control). That is, the probability of falsely deeming a concept
            important is below a user-defined level $\\alpha \in (0,1)$.

            For an image $x$, a concept $j$, and a subset $S \subseteq [m] \setminus \{j\}$ (i.e., any 
            subset that does not contain $j$), we define the null hypothesis:

            $$
                H_0:~\hat{Y}_{S \cup \{j\}} \overset{d}{=} \hat{Y}_S,
            $$
            where $\overset{d}{=}$ denotes equality in distribution, and $\\forall C \subseteq [m]$,
            $\hat{Y}_C = g(\widetilde{H}_C)$, $\widetilde{H}_C \sim P_{H \mid Z_C = z_C}$ is the conditional distribution of the dense
            embeddings given the observed concepts in $z_C$, i.e. the semantics of $x$.
            Then, rejecting $H_0$ means the concept $j$ affects the distribution of the response of
            the model, and it is important.
            """
        )

        st.markdown(
            """
            ### 3. Sampling Conditional Embeddings
            """
        )
        st.image(
            "./assets/about/local_dist.jpg",
            caption=(
                "Figure 2: Example test (i.e., with concept) and null (i.e., without"
                " concept) distributions for a class-specific concept and a non-class"
                " specific one on three images from the Imagenette dataset as a"
                " function of the size of S."
            ),
            use_column_width=True,
        )
        st.markdown(
            """
            In order to test for $H_0$ defined above, we need to sample from the conditional distribution
            of the dense embeddings given certain concepts. This can be seen as solving a linear inverse
            problem stochastically since $Z = c^{\\top} H$. In this work, given that $m$ is small, we use
            nonparametric kernel density estimation (KDE) methods to approximate the target distribution. 
            
            Intuitively, given a dataset $\{(h^{(i)}, z^{(i)})\}_{i=1}^n$ of dense embeddings with
            their semantics, we:

            1. Use a weighted KDE to sample $\widetilde{Z} \sim P_{Z \mid Z_C = z_C}$, and then
            2. Retrieve the embedding $H^{(i')}$ whose concept representation $Z^{(i')}$ is the
            nearest neighbor of $\widetilde{Z}$ in the dataset.

            Details on the weighted KDE and the sampling procedure are included in the paper. Figure 2
            shows some example test (i.e., $\hat{Y}_{S \cup \{j\}}$) and 
            null (i.e., $\hat{Y}_{S}$) distributions for a class-specific concept and a non-class
            specific one on three images from the Imagenette$^3$ dataset. We can see that the test 
            distributions of class-specific concepts are skewed to the right, i.e. including the observed 
            class-specific concept increases the output of the predictor. Furthermore, we see the shift 
            decreases the more concepts are included in $S$, i.e. if $S$ is larger and it contains more
            information, then the marginal contribution of adding one concept will be smaller. 
            On the other hand, including a non-class-specific concept does not change the distribution 
            of the response of the model, no matter the size of $S$.
            """
        )

        st.markdown(
            """
            ### 4. Testing by Betting

            Instead of classical hypothesis testing techniques based on $p$-values, we propose to
            test for the importance of concepts by *betting*.$^4$ This choice is motivated by two important
            properties of sequential tests:

            1. They are **adaptive** to the hardness of the problem. That is, the easier it is to reject
            a null hypothesis, the earlier the test will stop. This induce a natural ranking of importance
            across concepts: if concept $j$ rejects faster than $j'$, then $j$ is more important than $j'$.

            2. They are **efficient** because they only use as much data as needed to reject, instead of
            the entire data available as traditional, offline tests.

            Sequential tests instantiate a game between a *bettor* and *nature*. At every turn of the game,
            the bettor places a wager against the null hypothesis, and the nature reveals the truth. If
            the bettor wins, they will accumulate wealth, or loose some otherwise. More formally, the
            *wealth process* $\{K_t\}_{t \in \mathbb{N}_0}$ is defined as

            $$
                K_0 = 1, \\quad K_{t+1} = K_t \cdot (1 + v_t\kappa_t),
            $$
            where $v_t \in [-1,1]$ is a betting fraction, and $\kappa_t \in [-1,1]$ is the payoff of the bet.
            Under certain conditions, the wealth process describes a *fair game*, and for $\\alpha \in (0,1)$,
            it holds that

            $$
                \mathbb{P}_{H_0}[\exists t:~K_t \geq 1/\\alpha] \leq \\alpha.
            $$

            That is, the wealth process can be used to reject the null hypothesis $H_0$ with
            Type I error control at level $\\alpha$.

            Briefly, we use ideas of sequential kernelized independence testing (SKIT)$^5$ and define 
            the payoff as

            $$
                \kappa_t \coloneqq \\tanh\left(\\rho_t(\hat{Y}_{S \cup \{j\}}) - \\rho_t(\hat{Y}_S)\\right)
            $$
            and 
            $$
                \\rho_t = \widehat{\\text{MMD}}(\hat{Y}_{S \cup \{j\}}, \hat{Y}_S)
            $$
            is the plug-in estimator of the maximum mean discrepancy (MMD)$^6$ between the test and
            null distributions at time $t$. Furthermore, we use the online Newtown step (ONS)$^7$ method
            to choose the betting fraction $v_t$ and ensure exponential growth of the wealth.
            """
        )

        st.markdown(
            """
            ---

            **References**

            [1] CLIP is available at https://github.com/openai/CLIP .

            [2] OpenCLIP is available at https://github.com/mlfoundations/open_clip .

            [3] The Imagenette dataset is available at https://github.com/fastai/imagenette .
            
            [4] Glenn Shafer. Testing by betting: A strategy for statistical and scientific communication.
            Journal of the Royal Statistical Society Series A: Statistics in Society, 184(2):407-431, 2021.

            [5] Aleksandr Podkopaev et al. Sequential kernelized independence testing. In International 
            Conference on Machine Learning, pages 27957-27993. PMLR, 2023.

            [6] Arthur Gretton et al. A kernel two-sample test. The Journal of Machine Learning Research,
            13(1):723-773, 2012.

            [7] Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online 
            learning in banach spaces. In Conference On Learning Theory, pages 1493-1529. PMLR, 2018.
            """
        )