Spaces:
Sleeping
Sleeping
File size: 9,831 Bytes
468f744 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
import streamlit as st
def about():
_, centercol, _ = st.columns([1, 3, 1])
with centercol:
st.markdown(
"""
## Testing Semantic Importance via Betting
We briefly present here the main ideas and contributions.
"""
)
st.markdown("""### 1. Setup""")
st.image(
"./assets/about/setup.jpg",
caption="Figure 1: Pictorial representation of the setup.",
use_column_width=True,
)
st.markdown(
"""
We consider classification problems with:
* **Input image** $X \in \mathcal{X}$.
* **Feature encoder** $f:~\mathcal{X} \\to \mathbb{R}^d$ that maps input
images to dense embeddings $H = f(X) \in \mathbb{R}^d$.
* **Classifier** $g:~\mathbb{R}^d \\to [0,1]^k$ that separates embeddings
into one of $k$ classes. We do not assume $g$ has a particular form and it
can be any fixed, potentially nonlinear function.
* **Concept bank** $c = [c_1, \dots, c_m] \in \mathbb{R}^{d \\times m}$ such
that $c_j \in \mathbb{R}^d$ is the representation of the $j^{\\text{th}}$ concept.
We assume thet $c$ is user-defined and that $m$ is small ($m \\approx 20$).
* **Semantics** $Z = [Z_1, \dots, Z_m] = c^{\\top} H$ where $Z_j \in [-1, 1]$ represents the
amount of concept $j$ present in the dense embedding of input image $X$.
For example:
* $f$ is the image encoder of a vision-language model (e.g., CLIP$^1$, OpenCLIP$^2$).
* $g$ is the zero-shot classifier obtained by encoding *``A photo of a <CLASS_NAME>''* with the
text encoder of the same vision-language model.
* $c$ is obtained similarly by encoding the user-defined concepts.
"""
)
st.markdown(
"""
### 2. Defining Semantic Importance
Our goal is to test the statistical importance of the concepts in $c$ for the
predictions of the given classifier on a particular image $x$ (capital letters denote random
variables, and lowercase letters their realizations).
We do not train a surrogate, interpretable model and instead consider the original, potentially
nonlinear classifier $g$. This is because we want to study the semantic importance of
the model that would be deployed in real-world settings and not a surrogate one that
might decrease performance.
We define importance from the perspective of conditional independence testing because
it allows for rigorous statistical testing with false positive rate control
(i.e., Type I error control). That is, the probability of falsely deeming a concept
important is below a user-defined level $\\alpha \in (0,1)$.
For an image $x$, a concept $j$, and a subset $S \subseteq [m] \setminus \{j\}$ (i.e., any
subset that does not contain $j$), we define the null hypothesis:
$$
H_0:~\hat{Y}_{S \cup \{j\}} \overset{d}{=} \hat{Y}_S,
$$
where $\overset{d}{=}$ denotes equality in distribution, and $\\forall C \subseteq [m]$,
$\hat{Y}_C = g(\widetilde{H}_C)$, $\widetilde{H}_C \sim P_{H \mid Z_C = z_C}$ is the conditional distribution of the dense
embeddings given the observed concepts in $z_C$, i.e. the semantics of $x$.
Then, rejecting $H_0$ means the concept $j$ affects the distribution of the response of
the model, and it is important.
"""
)
st.markdown(
"""
### 3. Sampling Conditional Embeddings
"""
)
st.image(
"./assets/about/local_dist.jpg",
caption=(
"Figure 2: Example test (i.e., with concept) and null (i.e., without"
" concept) distributions for a class-specific concept and a non-class"
" specific one on three images from the Imagenette dataset as a"
" function of the size of S."
),
use_column_width=True,
)
st.markdown(
"""
In order to test for $H_0$ defined above, we need to sample from the conditional distribution
of the dense embeddings given certain concepts. This can be seen as solving a linear inverse
problem stochastically since $Z = c^{\\top} H$. In this work, given that $m$ is small, we use
nonparametric kernel density estimation (KDE) methods to approximate the target distribution.
Intuitively, given a dataset $\{(h^{(i)}, z^{(i)})\}_{i=1}^n$ of dense embeddings with
their semantics, we:
1. Use a weighted KDE to sample $\widetilde{Z} \sim P_{Z \mid Z_C = z_C}$, and then
2. Retrieve the embedding $H^{(i')}$ whose concept representation $Z^{(i')}$ is the
nearest neighbor of $\widetilde{Z}$ in the dataset.
Details on the weighted KDE and the sampling procedure are included in the paper. Figure 2
shows some example test (i.e., $\hat{Y}_{S \cup \{j\}}$) and
null (i.e., $\hat{Y}_{S}$) distributions for a class-specific concept and a non-class
specific one on three images from the Imagenette$^3$ dataset. We can see that the test
distributions of class-specific concepts are skewed to the right, i.e. including the observed
class-specific concept increases the output of the predictor. Furthermore, we see the shift
decreases the more concepts are included in $S$, i.e. if $S$ is larger and it contains more
information, then the marginal contribution of adding one concept will be smaller.
On the other hand, including a non-class-specific concept does not change the distribution
of the response of the model, no matter the size of $S$.
"""
)
st.markdown(
"""
### 4. Testing by Betting
Instead of classical hypothesis testing techniques based on $p$-values, we propose to
test for the importance of concepts by *betting*.$^4$ This choice is motivated by two important
properties of sequential tests:
1. They are **adaptive** to the hardness of the problem. That is, the easier it is to reject
a null hypothesis, the earlier the test will stop. This induce a natural ranking of importance
across concepts: if concept $j$ rejects faster than $j'$, then $j$ is more important than $j'$.
2. They are **efficient** because they only use as much data as needed to reject, instead of
the entire data available as traditional, offline tests.
Sequential tests instantiate a game between a *bettor* and *nature*. At every turn of the game,
the bettor places a wager against the null hypothesis, and the nature reveals the truth. If
the bettor wins, they will accumulate wealth, or loose some otherwise. More formally, the
*wealth process* $\{K_t\}_{t \in \mathbb{N}_0}$ is defined as
$$
K_0 = 1, \\quad K_{t+1} = K_t \cdot (1 + v_t\kappa_t),
$$
where $v_t \in [-1,1]$ is a betting fraction, and $\kappa_t \in [-1,1]$ is the payoff of the bet.
Under certain conditions, the wealth process describes a *fair game*, and for $\\alpha \in (0,1)$,
it holds that
$$
\mathbb{P}_{H_0}[\exists t:~K_t \geq 1/\\alpha] \leq \\alpha.
$$
That is, the wealth process can be used to reject the null hypothesis $H_0$ with
Type I error control at level $\\alpha$.
Briefly, we use ideas of sequential kernelized independence testing (SKIT)$^5$ and define
the payoff as
$$
\kappa_t \coloneqq \\tanh\left(\\rho_t(\hat{Y}_{S \cup \{j\}}) - \\rho_t(\hat{Y}_S)\\right)
$$
and
$$
\\rho_t = \widehat{\\text{MMD}}(\hat{Y}_{S \cup \{j\}}, \hat{Y}_S)
$$
is the plug-in estimator of the maximum mean discrepancy (MMD)$^6$ between the test and
null distributions at time $t$. Furthermore, we use the online Newtown step (ONS)$^7$ method
to choose the betting fraction $v_t$ and ensure exponential growth of the wealth.
"""
)
st.markdown(
"""
---
**References**
[1] CLIP is available at https://github.com/openai/CLIP .
[2] OpenCLIP is available at https://github.com/mlfoundations/open_clip .
[3] The Imagenette dataset is available at https://github.com/fastai/imagenette .
[4] Glenn Shafer. Testing by betting: A strategy for statistical and scientific communication.
Journal of the Royal Statistical Society Series A: Statistics in Society, 184(2):407-431, 2021.
[5] Aleksandr Podkopaev et al. Sequential kernelized independence testing. In International
Conference on Machine Learning, pages 27957-27993. PMLR, 2023.
[6] Arthur Gretton et al. A kernel two-sample test. The Journal of Machine Learning Research,
13(1):723-773, 2012.
[7] Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online
learning in banach spaces. In Conference On Learning Theory, pages 1493-1529. PMLR, 2018.
"""
)
|