xiv

CONTENTS

2.3

2.4

I

2.2.4 Calibration . . . . . . . . . . . . . . . .
2.2.5 Predictive Uncertainty Quantification .
2.2.6 Failure Prediction . . . . . . . . . . . .
Document Understanding . . . . . . . . . . . .
2.3.1 Task Definitions . . . . . . . . . . . . .
2.3.2 Datasets . . . . . . . . . . . . . . . . . .
2.3.3 Models . . . . . . . . . . . . . . . . . .
2.3.4 Challenges in Document Understanding
2.3.4.1 Long-Context Modeling . . . .
2.3.4.2 Document Structure Modeling
Intelligent Automation . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

Reliable and Robust Deep Learning

3 Benchmarking Scalable Predictive Uncertainty in Text Classification
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Uncertainty Methods . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Quantifying Uncertainty in Deep Learning . . . . . . . .
3.3.2 Predictive Uncertainty Methods . . . . . . . . . . . . .
3.3.2.1 Monte Carlo Dropout . . . . . . . . . . . . . .
3.3.2.2 Deep Ensemble . . . . . . . . . . . . . . . . . .
3.3.2.3 Concrete Dropout . . . . . . . . . . . . . . . .
3.3.2.4 Heteroscedastic Extensions . . . . . . . . . . .
3.3.3 Uncertainty Estimation . . . . . . . . . . . . . . . . . .
3.3.4 Motivating Hybrid Approaches . . . . . . . . . . . . . .
3.3.5 Uncertainty Calibration under Distribution Shift . . . .
3.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . .
3.4.1 Proposed Hybrid Approaches . . . . . . . . . . . . . . .
3.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . .
3.4.5 Experimental design . . . . . . . . . . . . . . . . . . . .
3.4.5.1 In-domain Setting . . . . . . . . . . . . . . . .
3.4.5.2 Cross-domain Setting . . . . . . . . . . . . . .
3.4.5.3 Novelty Detection Setting . . . . . . . . . . . .
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Experiment: In-domain . . . . . . . . . . . . . . . . . .
3.5.2 Experiment: Cross-domain . . . . . . . . . . . . . . . .
3.5.3 Experiment: Novelty Detection . . . . . . . . . . . . . .
3.5.4 Experiment: Ablations . . . . . . . . . . . . . . . . . . .
3.5.4.1 Diversity . . . . . . . . . . . . . . . . . . . . .

28
30
32
33
35
36
37
38
39
40
41

43
44
46
48
51
51
52
53
53
54
54
55
58
59
61
61
63
64
66
66
67
67
68
69
70
71
73
75
76