jordyvl commited on
Commit
e0a78f5
1 Parent(s): c2e5c18

First commit

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +291 -0
  2. OCR_directory.sh +17 -0
  3. app.py +114 -67
  4. assets/txts/pg_0002.txt +1 -0
  5. assets/txts/pg_0003.txt +25 -0
  6. assets/txts/pg_0004.txt +10 -0
  7. assets/txts/pg_0005.txt +32 -0
  8. assets/txts/pg_0006.txt +45 -0
  9. assets/txts/pg_0007.txt +33 -0
  10. assets/txts/pg_0008.txt +35 -0
  11. assets/txts/pg_0009.txt +34 -0
  12. assets/txts/pg_0010.txt +44 -0
  13. assets/txts/pg_0013.txt +22 -0
  14. assets/txts/pg_0014.txt +30 -0
  15. assets/txts/pg_0015.txt +30 -0
  16. assets/txts/pg_0016.txt +12 -0
  17. assets/txts/pg_0017.txt +200 -0
  18. assets/txts/pg_0018.txt +204 -0
  19. assets/txts/pg_0019.txt +394 -0
  20. assets/txts/pg_0020.txt +320 -0
  21. assets/txts/pg_0021.txt +421 -0
  22. assets/txts/pg_0033.txt +30 -0
  23. assets/txts/pg_0034.txt +44 -0
  24. assets/txts/pg_0035.txt +46 -0
  25. assets/txts/pg_0036.txt +80 -0
  26. assets/txts/pg_0037.txt +42 -0
  27. assets/txts/pg_0038.txt +45 -0
  28. assets/txts/pg_0039.txt +41 -0
  29. assets/txts/pg_0040.txt +35 -0
  30. assets/txts/pg_0041.txt +27 -0
  31. assets/txts/pg_0042.txt +31 -0
  32. assets/txts/pg_0043.txt +32 -0
  33. assets/txts/pg_0044.txt +163 -0
  34. assets/txts/pg_0045.txt +53 -0
  35. assets/txts/pg_0046.txt +47 -0
  36. assets/txts/pg_0047.txt +53 -0
  37. assets/txts/pg_0048.txt +45 -0
  38. assets/txts/pg_0049.txt +41 -0
  39. assets/txts/pg_0050.txt +46 -0
  40. assets/txts/pg_0051.txt +58 -0
  41. assets/txts/pg_0052.txt +38 -0
  42. assets/txts/pg_0053.txt +46 -0
  43. assets/txts/pg_0054.txt +45 -0
  44. assets/txts/pg_0055.txt +45 -0
  45. assets/txts/pg_0056.txt +39 -0
  46. assets/txts/pg_0057.txt +98 -0
  47. assets/txts/pg_0058.txt +62 -0
  48. assets/txts/pg_0059.txt +64 -0
  49. assets/txts/pg_0060.txt +53 -0
  50. assets/txts/pg_0061.txt +42 -0
.gitattributes CHANGED
@@ -33,3 +33,294 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.pdf filter=lfs diff=lfs merge=lfs -text
37
+ assets/pdfs/pg_0031.pdf filter=lfs diff=lfs merge=lfs -text
38
+ assets/pdfs/pg_0081.pdf filter=lfs diff=lfs merge=lfs -text
39
+ assets/pdfs/pg_0123.pdf filter=lfs diff=lfs merge=lfs -text
40
+ assets/pdfs/pg_0155.pdf filter=lfs diff=lfs merge=lfs -text
41
+ assets/pdfs/pg_0216.pdf filter=lfs diff=lfs merge=lfs -text
42
+ assets/pdfs/pg_0277.pdf filter=lfs diff=lfs merge=lfs -text
43
+ assets/pdfs/pg_0015.pdf filter=lfs diff=lfs merge=lfs -text
44
+ assets/pdfs/pg_0047.pdf filter=lfs diff=lfs merge=lfs -text
45
+ assets/pdfs/pg_0051.pdf filter=lfs diff=lfs merge=lfs -text
46
+ assets/pdfs/pg_0054.pdf filter=lfs diff=lfs merge=lfs -text
47
+ assets/pdfs/pg_0088.pdf filter=lfs diff=lfs merge=lfs -text
48
+ assets/pdfs/pg_0250.pdf filter=lfs diff=lfs merge=lfs -text
49
+ assets/pdfs/pg_0009.pdf filter=lfs diff=lfs merge=lfs -text
50
+ assets/pdfs/pg_0089.pdf filter=lfs diff=lfs merge=lfs -text
51
+ assets/pdfs/pg_0117.pdf filter=lfs diff=lfs merge=lfs -text
52
+ assets/pdfs/pg_0241.pdf filter=lfs diff=lfs merge=lfs -text
53
+ assets/pdfs/pg_0101.pdf filter=lfs diff=lfs merge=lfs -text
54
+ assets/pdfs/pg_0110.pdf filter=lfs diff=lfs merge=lfs -text
55
+ assets/pdfs/pg_0208.pdf filter=lfs diff=lfs merge=lfs -text
56
+ assets/pdfs/pg_0226.pdf filter=lfs diff=lfs merge=lfs -text
57
+ assets/pdfs/pg_0284.pdf filter=lfs diff=lfs merge=lfs -text
58
+ assets/pdfs/pg_0060.pdf filter=lfs diff=lfs merge=lfs -text
59
+ assets/pdfs/pg_0252.pdf filter=lfs diff=lfs merge=lfs -text
60
+ assets/pdfs/pg_0058.pdf filter=lfs diff=lfs merge=lfs -text
61
+ assets/pdfs/pg_0099.pdf filter=lfs diff=lfs merge=lfs -text
62
+ assets/pdfs/pg_0195.pdf filter=lfs diff=lfs merge=lfs -text
63
+ assets/pdfs/pg_0057.pdf filter=lfs diff=lfs merge=lfs -text
64
+ assets/pdfs/pg_0105.pdf filter=lfs diff=lfs merge=lfs -text
65
+ assets/pdfs/pg_0125.pdf filter=lfs diff=lfs merge=lfs -text
66
+ assets/pdfs/pg_0169.pdf filter=lfs diff=lfs merge=lfs -text
67
+ assets/pdfs/pg_0184.pdf filter=lfs diff=lfs merge=lfs -text
68
+ assets/pdfs/pg_0196.pdf filter=lfs diff=lfs merge=lfs -text
69
+ assets/pdfs/pg_0075.pdf filter=lfs diff=lfs merge=lfs -text
70
+ assets/pdfs/pg_0236.pdf filter=lfs diff=lfs merge=lfs -text
71
+ assets/pdfs/pg_0276.pdf filter=lfs diff=lfs merge=lfs -text
72
+ assets/pdfs/pg_0006.pdf filter=lfs diff=lfs merge=lfs -text
73
+ assets/pdfs/pg_0156.pdf filter=lfs diff=lfs merge=lfs -text
74
+ assets/pdfs/pg_0082.pdf filter=lfs diff=lfs merge=lfs -text
75
+ assets/pdfs/pg_0106.pdf filter=lfs diff=lfs merge=lfs -text
76
+ assets/pdfs/pg_0157.pdf filter=lfs diff=lfs merge=lfs -text
77
+ assets/pdfs/pg_0188.pdf filter=lfs diff=lfs merge=lfs -text
78
+ assets/pdfs/pg_0201.pdf filter=lfs diff=lfs merge=lfs -text
79
+ assets/pdfs/pg_0225.pdf filter=lfs diff=lfs merge=lfs -text
80
+ assets/pdfs/pg_0248.pdf filter=lfs diff=lfs merge=lfs -text
81
+ assets/pdfs/pg_0023.pdf filter=lfs diff=lfs merge=lfs -text
82
+ assets/pdfs/pg_0116.pdf filter=lfs diff=lfs merge=lfs -text
83
+ assets/pdfs/pg_0119.pdf filter=lfs diff=lfs merge=lfs -text
84
+ assets/pdfs/pg_0254.pdf filter=lfs diff=lfs merge=lfs -text
85
+ assets/pdfs/pg_0278.pdf filter=lfs diff=lfs merge=lfs -text
86
+ assets/pdfs/pg_0045.pdf filter=lfs diff=lfs merge=lfs -text
87
+ assets/pdfs/pg_0093.pdf filter=lfs diff=lfs merge=lfs -text
88
+ assets/pdfs/pg_0182.pdf filter=lfs diff=lfs merge=lfs -text
89
+ assets/pdfs/pg_0064.pdf filter=lfs diff=lfs merge=lfs -text
90
+ assets/pdfs/pg_0094.pdf filter=lfs diff=lfs merge=lfs -text
91
+ assets/pdfs/pg_0104.pdf filter=lfs diff=lfs merge=lfs -text
92
+ assets/pdfs/pg_0113.pdf filter=lfs diff=lfs merge=lfs -text
93
+ assets/pdfs/pg_0150.pdf filter=lfs diff=lfs merge=lfs -text
94
+ assets/pdfs/pg_0189.pdf filter=lfs diff=lfs merge=lfs -text
95
+ assets/pdfs/pg_0220.pdf filter=lfs diff=lfs merge=lfs -text
96
+ assets/pdfs/pg_0261.pdf filter=lfs diff=lfs merge=lfs -text
97
+ assets/pdfs/pg_0011.pdf filter=lfs diff=lfs merge=lfs -text
98
+ assets/pdfs/pg_0048.pdf filter=lfs diff=lfs merge=lfs -text
99
+ assets/pdfs/pg_0288.pdf filter=lfs diff=lfs merge=lfs -text
100
+ assets/pdfs/pg_0034.pdf filter=lfs diff=lfs merge=lfs -text
101
+ assets/pdfs/pg_0108.pdf filter=lfs diff=lfs merge=lfs -text
102
+ assets/pdfs/pg_0214.pdf filter=lfs diff=lfs merge=lfs -text
103
+ assets/pdfs/pg_0287.pdf filter=lfs diff=lfs merge=lfs -text
104
+ assets/pdfs/pg_0100.pdf filter=lfs diff=lfs merge=lfs -text
105
+ assets/pdfs/pg_0198.pdf filter=lfs diff=lfs merge=lfs -text
106
+ assets/pdfs/pg_0227.pdf filter=lfs diff=lfs merge=lfs -text
107
+ assets/pdfs/pg_0244.pdf filter=lfs diff=lfs merge=lfs -text
108
+ assets/pdfs/pg_0245.pdf filter=lfs diff=lfs merge=lfs -text
109
+ assets/pdfs/pg_0270.pdf filter=lfs diff=lfs merge=lfs -text
110
+ assets/pdfs/pg_0039.pdf filter=lfs diff=lfs merge=lfs -text
111
+ assets/pdfs/pg_0055.pdf filter=lfs diff=lfs merge=lfs -text
112
+ assets/pdfs/pg_0086.pdf filter=lfs diff=lfs merge=lfs -text
113
+ assets/pdfs/pg_0174.pdf filter=lfs diff=lfs merge=lfs -text
114
+ assets/pdfs/pg_0181.pdf filter=lfs diff=lfs merge=lfs -text
115
+ assets/pdfs/pg_0266.pdf filter=lfs diff=lfs merge=lfs -text
116
+ assets/pdfs/pg_0283.pdf filter=lfs diff=lfs merge=lfs -text
117
+ assets/pdfs/pg_0073.pdf filter=lfs diff=lfs merge=lfs -text
118
+ assets/pdfs/pg_0080.pdf filter=lfs diff=lfs merge=lfs -text
119
+ assets/pdfs/pg_0274.pdf filter=lfs diff=lfs merge=lfs -text
120
+ assets/pdfs/pg_0279.pdf filter=lfs diff=lfs merge=lfs -text
121
+ assets/pdfs/pg_0036.pdf filter=lfs diff=lfs merge=lfs -text
122
+ assets/pdfs/pg_0050.pdf filter=lfs diff=lfs merge=lfs -text
123
+ assets/pdfs/pg_0069.pdf filter=lfs diff=lfs merge=lfs -text
124
+ assets/pdfs/pg_0053.pdf filter=lfs diff=lfs merge=lfs -text
125
+ assets/pdfs/pg_0056.pdf filter=lfs diff=lfs merge=lfs -text
126
+ assets/pdfs/pg_0145.pdf filter=lfs diff=lfs merge=lfs -text
127
+ assets/pdfs/pg_0027.pdf filter=lfs diff=lfs merge=lfs -text
128
+ assets/pdfs/pg_0067.pdf filter=lfs diff=lfs merge=lfs -text
129
+ assets/pdfs/pg_0079.pdf filter=lfs diff=lfs merge=lfs -text
130
+ assets/pdfs/pg_0013.pdf filter=lfs diff=lfs merge=lfs -text
131
+ assets/pdfs/pg_0072.pdf filter=lfs diff=lfs merge=lfs -text
132
+ assets/pdfs/pg_0191.pdf filter=lfs diff=lfs merge=lfs -text
133
+ assets/pdfs/pg_0263.pdf filter=lfs diff=lfs merge=lfs -text
134
+ assets/pdfs/pg_0268.pdf filter=lfs diff=lfs merge=lfs -text
135
+ assets/pdfs/pg_0041.pdf filter=lfs diff=lfs merge=lfs -text
136
+ assets/pdfs/pg_0136.pdf filter=lfs diff=lfs merge=lfs -text
137
+ assets/pdfs/pg_0170.pdf filter=lfs diff=lfs merge=lfs -text
138
+ assets/pdfs/pg_0180.pdf filter=lfs diff=lfs merge=lfs -text
139
+ assets/pdfs/pg_0200.pdf filter=lfs diff=lfs merge=lfs -text
140
+ assets/pdfs/pg_0217.pdf filter=lfs diff=lfs merge=lfs -text
141
+ assets/pdfs/pg_0280.pdf filter=lfs diff=lfs merge=lfs -text
142
+ assets/pdfs/pg_0016.pdf filter=lfs diff=lfs merge=lfs -text
143
+ assets/pdfs/pg_0018.pdf filter=lfs diff=lfs merge=lfs -text
144
+ assets/pdfs/pg_0062.pdf filter=lfs diff=lfs merge=lfs -text
145
+ assets/pdfs/pg_0122.pdf filter=lfs diff=lfs merge=lfs -text
146
+ assets/pdfs/pg_0147.pdf filter=lfs diff=lfs merge=lfs -text
147
+ assets/pdfs/pg_0265.pdf filter=lfs diff=lfs merge=lfs -text
148
+ assets/pdfs/pg_0215.pdf filter=lfs diff=lfs merge=lfs -text
149
+ assets/pdfs/pg_0133.pdf filter=lfs diff=lfs merge=lfs -text
150
+ assets/pdfs/pg_0165.pdf filter=lfs diff=lfs merge=lfs -text
151
+ assets/pdfs/pg_0166.pdf filter=lfs diff=lfs merge=lfs -text
152
+ assets/pdfs/pg_0222.pdf filter=lfs diff=lfs merge=lfs -text
153
+ assets/pdfs/pg_0078.pdf filter=lfs diff=lfs merge=lfs -text
154
+ assets/pdfs/pg_0171.pdf filter=lfs diff=lfs merge=lfs -text
155
+ assets/pdfs/pg_0219.pdf filter=lfs diff=lfs merge=lfs -text
156
+ assets/pdfs/pg_0028.pdf filter=lfs diff=lfs merge=lfs -text
157
+ assets/pdfs/pg_0107.pdf filter=lfs diff=lfs merge=lfs -text
158
+ assets/pdfs/pg_0144.pdf filter=lfs diff=lfs merge=lfs -text
159
+ assets/pdfs/pg_0178.pdf filter=lfs diff=lfs merge=lfs -text
160
+ assets/pdfs/pg_0190.pdf filter=lfs diff=lfs merge=lfs -text
161
+ assets/pdfs/pg_0043.pdf filter=lfs diff=lfs merge=lfs -text
162
+ assets/pdfs/pg_0010.pdf filter=lfs diff=lfs merge=lfs -text
163
+ assets/pdfs/pg_0021.pdf filter=lfs diff=lfs merge=lfs -text
164
+ assets/pdfs/pg_0160.pdf filter=lfs diff=lfs merge=lfs -text
165
+ assets/pdfs/pg_0247.pdf filter=lfs diff=lfs merge=lfs -text
166
+ assets/pdfs/pg_0063.pdf filter=lfs diff=lfs merge=lfs -text
167
+ assets/pdfs/pg_0090.pdf filter=lfs diff=lfs merge=lfs -text
168
+ assets/pdfs/pg_0137.pdf filter=lfs diff=lfs merge=lfs -text
169
+ assets/pdfs/pg_0159.pdf filter=lfs diff=lfs merge=lfs -text
170
+ assets/pdfs/pg_0269.pdf filter=lfs diff=lfs merge=lfs -text
171
+ assets/pdfs/pg_0014.pdf filter=lfs diff=lfs merge=lfs -text
172
+ assets/pdfs/pg_0026.pdf filter=lfs diff=lfs merge=lfs -text
173
+ assets/pdfs/pg_0033.pdf filter=lfs diff=lfs merge=lfs -text
174
+ assets/pdfs/pg_0035.pdf filter=lfs diff=lfs merge=lfs -text
175
+ assets/pdfs/pg_0046.pdf filter=lfs diff=lfs merge=lfs -text
176
+ assets/pdfs/pg_0186.pdf filter=lfs diff=lfs merge=lfs -text
177
+ assets/pdfs/pg_0237.pdf filter=lfs diff=lfs merge=lfs -text
178
+ assets/pdfs/pg_0179.pdf filter=lfs diff=lfs merge=lfs -text
179
+ assets/pdfs/pg_0193.pdf filter=lfs diff=lfs merge=lfs -text
180
+ assets/pdfs/pg_0232.pdf filter=lfs diff=lfs merge=lfs -text
181
+ assets/pdfs/pg_0109.pdf filter=lfs diff=lfs merge=lfs -text
182
+ assets/pdfs/pg_0134.pdf filter=lfs diff=lfs merge=lfs -text
183
+ assets/pdfs/pg_0286.pdf filter=lfs diff=lfs merge=lfs -text
184
+ assets/pdfs/pg_0003.pdf filter=lfs diff=lfs merge=lfs -text
185
+ assets/pdfs/pg_0004.pdf filter=lfs diff=lfs merge=lfs -text
186
+ assets/pdfs/pg_0206.pdf filter=lfs diff=lfs merge=lfs -text
187
+ assets/pdfs/pg_0251.pdf filter=lfs diff=lfs merge=lfs -text
188
+ assets/pdfs/pg_0040.pdf filter=lfs diff=lfs merge=lfs -text
189
+ assets/pdfs/pg_0083.pdf filter=lfs diff=lfs merge=lfs -text
190
+ assets/pdfs/pg_0230.pdf filter=lfs diff=lfs merge=lfs -text
191
+ assets/pdfs/pg_0272.pdf filter=lfs diff=lfs merge=lfs -text
192
+ assets/pdfs/pg_0275.pdf filter=lfs diff=lfs merge=lfs -text
193
+ assets/pdfs/pg_0096.pdf filter=lfs diff=lfs merge=lfs -text
194
+ assets/pdfs/pg_0115.pdf filter=lfs diff=lfs merge=lfs -text
195
+ assets/pdfs/pg_0260.pdf filter=lfs diff=lfs merge=lfs -text
196
+ assets/pdfs/pg_0271.pdf filter=lfs diff=lfs merge=lfs -text
197
+ assets/pdfs/pg_0012.pdf filter=lfs diff=lfs merge=lfs -text
198
+ assets/pdfs/pg_0022.pdf filter=lfs diff=lfs merge=lfs -text
199
+ assets/pdfs/pg_0176.pdf filter=lfs diff=lfs merge=lfs -text
200
+ assets/pdfs/pg_0218.pdf filter=lfs diff=lfs merge=lfs -text
201
+ assets/pdfs/pg_0273.pdf filter=lfs diff=lfs merge=lfs -text
202
+ assets/pdfs/pg_0065.pdf filter=lfs diff=lfs merge=lfs -text
203
+ assets/pdfs/pg_0132.pdf filter=lfs diff=lfs merge=lfs -text
204
+ assets/pdfs/pg_0187.pdf filter=lfs diff=lfs merge=lfs -text
205
+ assets/pdfs/pg_0267.pdf filter=lfs diff=lfs merge=lfs -text
206
+ assets/pdfs/pg_0044.pdf filter=lfs diff=lfs merge=lfs -text
207
+ assets/pdfs/pg_0029.pdf filter=lfs diff=lfs merge=lfs -text
208
+ assets/pdfs/pg_0084.pdf filter=lfs diff=lfs merge=lfs -text
209
+ assets/pdfs/pg_0087.pdf filter=lfs diff=lfs merge=lfs -text
210
+ assets/pdfs/pg_0238.pdf filter=lfs diff=lfs merge=lfs -text
211
+ assets/pdfs/pg_0253.pdf filter=lfs diff=lfs merge=lfs -text
212
+ assets/pdfs/pg_0257.pdf filter=lfs diff=lfs merge=lfs -text
213
+ assets/pdfs/pg_0102.pdf filter=lfs diff=lfs merge=lfs -text
214
+ assets/pdfs/pg_0103.pdf filter=lfs diff=lfs merge=lfs -text
215
+ assets/pdfs/pg_0148.pdf filter=lfs diff=lfs merge=lfs -text
216
+ assets/pdfs/pg_0242.pdf filter=lfs diff=lfs merge=lfs -text
217
+ assets/pdfs/pg_0258.pdf filter=lfs diff=lfs merge=lfs -text
218
+ assets/pdfs/pg_0005.pdf filter=lfs diff=lfs merge=lfs -text
219
+ assets/pdfs/pg_0008.pdf filter=lfs diff=lfs merge=lfs -text
220
+ assets/pdfs/pg_0032.pdf filter=lfs diff=lfs merge=lfs -text
221
+ assets/pdfs/pg_0037.pdf filter=lfs diff=lfs merge=lfs -text
222
+ assets/pdfs/pg_0070.pdf filter=lfs diff=lfs merge=lfs -text
223
+ assets/pdfs/pg_0207.pdf filter=lfs diff=lfs merge=lfs -text
224
+ assets/pdfs/pg_0235.pdf filter=lfs diff=lfs merge=lfs -text
225
+ assets/pdfs/pg_0061.pdf filter=lfs diff=lfs merge=lfs -text
226
+ assets/pdfs/pg_0068.pdf filter=lfs diff=lfs merge=lfs -text
227
+ assets/pdfs/pg_0077.pdf filter=lfs diff=lfs merge=lfs -text
228
+ assets/pdfs/pg_0204.pdf filter=lfs diff=lfs merge=lfs -text
229
+ assets/pdfs/pg_0239.pdf filter=lfs diff=lfs merge=lfs -text
230
+ assets/pdfs/pg_0255.pdf filter=lfs diff=lfs merge=lfs -text
231
+ assets/pdfs/pg_0289.pdf filter=lfs diff=lfs merge=lfs -text
232
+ assets/pdfs/pg_0025.pdf filter=lfs diff=lfs merge=lfs -text
233
+ assets/pdfs/pg_0052.pdf filter=lfs diff=lfs merge=lfs -text
234
+ assets/pdfs/pg_0066.pdf filter=lfs diff=lfs merge=lfs -text
235
+ assets/pdfs/pg_0131.pdf filter=lfs diff=lfs merge=lfs -text
236
+ assets/pdfs/pg_0163.pdf filter=lfs diff=lfs merge=lfs -text
237
+ assets/pdfs/pg_0259.pdf filter=lfs diff=lfs merge=lfs -text
238
+ assets/pdfs/pg_0224.pdf filter=lfs diff=lfs merge=lfs -text
239
+ assets/pdfs/pg_0249.pdf filter=lfs diff=lfs merge=lfs -text
240
+ assets/pdfs/pg_0121.pdf filter=lfs diff=lfs merge=lfs -text
241
+ assets/pdfs/pg_0140.pdf filter=lfs diff=lfs merge=lfs -text
242
+ assets/pdfs/pg_0143.pdf filter=lfs diff=lfs merge=lfs -text
243
+ assets/pdfs/pg_0151.pdf filter=lfs diff=lfs merge=lfs -text
244
+ assets/pdfs/pg_0095.pdf filter=lfs diff=lfs merge=lfs -text
245
+ assets/pdfs/pg_0111.pdf filter=lfs diff=lfs merge=lfs -text
246
+ assets/pdfs/pg_0139.pdf filter=lfs diff=lfs merge=lfs -text
247
+ assets/pdfs/pg_0211.pdf filter=lfs diff=lfs merge=lfs -text
248
+ assets/pdfs/pg_0019.pdf filter=lfs diff=lfs merge=lfs -text
249
+ assets/pdfs/pg_0076.pdf filter=lfs diff=lfs merge=lfs -text
250
+ assets/pdfs/pg_0152.pdf filter=lfs diff=lfs merge=lfs -text
251
+ assets/pdfs/pg_0212.pdf filter=lfs diff=lfs merge=lfs -text
252
+ assets/pdfs/pg_0223.pdf filter=lfs diff=lfs merge=lfs -text
253
+ assets/pdfs/pg_0017.pdf filter=lfs diff=lfs merge=lfs -text
254
+ assets/pdfs/pg_0142.pdf filter=lfs diff=lfs merge=lfs -text
255
+ assets/pdfs/pg_0158.pdf filter=lfs diff=lfs merge=lfs -text
256
+ assets/pdfs/pg_0233.pdf filter=lfs diff=lfs merge=lfs -text
257
+ assets/pdfs/pg_0256.pdf filter=lfs diff=lfs merge=lfs -text
258
+ assets/pdfs/pg_0262.pdf filter=lfs diff=lfs merge=lfs -text
259
+ assets/pdfs/pg_0282.pdf filter=lfs diff=lfs merge=lfs -text
260
+ assets/pdfs/pg_0020.pdf filter=lfs diff=lfs merge=lfs -text
261
+ assets/pdfs/pg_0024.pdf filter=lfs diff=lfs merge=lfs -text
262
+ assets/pdfs/pg_0199.pdf filter=lfs diff=lfs merge=lfs -text
263
+ assets/pdfs/pg_0264.pdf filter=lfs diff=lfs merge=lfs -text
264
+ assets/pdfs/pg_0002.pdf filter=lfs diff=lfs merge=lfs -text
265
+ assets/pdfs/pg_0092.pdf filter=lfs diff=lfs merge=lfs -text
266
+ assets/pdfs/pg_0120.pdf filter=lfs diff=lfs merge=lfs -text
267
+ assets/pdfs/pg_0071.pdf filter=lfs diff=lfs merge=lfs -text
268
+ assets/pdfs/pg_0074.pdf filter=lfs diff=lfs merge=lfs -text
269
+ assets/pdfs/pg_0203.pdf filter=lfs diff=lfs merge=lfs -text
270
+ assets/pdfs/pg_0285.pdf filter=lfs diff=lfs merge=lfs -text
271
+ assets/pdfs/pg_0085.pdf filter=lfs diff=lfs merge=lfs -text
272
+ assets/pdfs/pg_0127.pdf filter=lfs diff=lfs merge=lfs -text
273
+ assets/pdfs/pg_0185.pdf filter=lfs diff=lfs merge=lfs -text
274
+ assets/pdfs/pg_0281.pdf filter=lfs diff=lfs merge=lfs -text
275
+ assets/pdfs/pg_0098.pdf filter=lfs diff=lfs merge=lfs -text
276
+ assets/pdfs/pg_0112.pdf filter=lfs diff=lfs merge=lfs -text
277
+ assets/pdfs/pg_0141.pdf filter=lfs diff=lfs merge=lfs -text
278
+ assets/pdfs/pg_0146.pdf filter=lfs diff=lfs merge=lfs -text
279
+ assets/pdfs/pg_0164.pdf filter=lfs diff=lfs merge=lfs -text
280
+ assets/pdfs/pg_0240.pdf filter=lfs diff=lfs merge=lfs -text
281
+ assets/pdfs/pg_0246.pdf filter=lfs diff=lfs merge=lfs -text
282
+ assets/pdfs/pg_0097.pdf filter=lfs diff=lfs merge=lfs -text
283
+ assets/pdfs/pg_0149.pdf filter=lfs diff=lfs merge=lfs -text
284
+ assets/pdfs/pg_0162.pdf filter=lfs diff=lfs merge=lfs -text
285
+ assets/pdfs/pg_0030.pdf filter=lfs diff=lfs merge=lfs -text
286
+ assets/pdfs/pg_0049.pdf filter=lfs diff=lfs merge=lfs -text
287
+ assets/pdfs/pg_0177.pdf filter=lfs diff=lfs merge=lfs -text
288
+ assets/pdfs/pg_0209.pdf filter=lfs diff=lfs merge=lfs -text
289
+ assets/pdfs/pg_0213.pdf filter=lfs diff=lfs merge=lfs -text
290
+ assets/pdfs/pg_0059.pdf filter=lfs diff=lfs merge=lfs -text
291
+ assets/pdfs/pg_0091.pdf filter=lfs diff=lfs merge=lfs -text
292
+ assets/pdfs/pg_0129.pdf filter=lfs diff=lfs merge=lfs -text
293
+ assets/pdfs/pg_0172.pdf filter=lfs diff=lfs merge=lfs -text
294
+ assets/pdfs/pg_0175.pdf filter=lfs diff=lfs merge=lfs -text
295
+ assets/pdfs/pg_0183.pdf filter=lfs diff=lfs merge=lfs -text
296
+ assets/pdfs/pg_0194.pdf filter=lfs diff=lfs merge=lfs -text
297
+ assets/pdfs/pg_0231.pdf filter=lfs diff=lfs merge=lfs -text
298
+ assets/pdfs/pg_0001.pdf filter=lfs diff=lfs merge=lfs -text
299
+ assets/pdfs/pg_0130.pdf filter=lfs diff=lfs merge=lfs -text
300
+ assets/pdfs/pg_0168.pdf filter=lfs diff=lfs merge=lfs -text
301
+ assets/pdfs/pg_0202.pdf filter=lfs diff=lfs merge=lfs -text
302
+ assets/pdfs/pg_0210.pdf filter=lfs diff=lfs merge=lfs -text
303
+ assets/pdfs/pg_0234.pdf filter=lfs diff=lfs merge=lfs -text
304
+ assets/pdfs/pg_0038.pdf filter=lfs diff=lfs merge=lfs -text
305
+ assets/pdfs/pg_0042.pdf filter=lfs diff=lfs merge=lfs -text
306
+ assets/pdfs/pg_0114.pdf filter=lfs diff=lfs merge=lfs -text
307
+ assets/pdfs/pg_0124.pdf filter=lfs diff=lfs merge=lfs -text
308
+ assets/pdfs/pg_0138.pdf filter=lfs diff=lfs merge=lfs -text
309
+ assets/pdfs/pg_0153.pdf filter=lfs diff=lfs merge=lfs -text
310
+ assets/pdfs/pg_0154.pdf filter=lfs diff=lfs merge=lfs -text
311
+ assets/pdfs/pg_0161.pdf filter=lfs diff=lfs merge=lfs -text
312
+ assets/pdfs/pg_0173.pdf filter=lfs diff=lfs merge=lfs -text
313
+ assets/pdfs/pg_0221.pdf filter=lfs diff=lfs merge=lfs -text
314
+ assets/pdfs/pg_0229.pdf filter=lfs diff=lfs merge=lfs -text
315
+ assets/pdfs/pg_0118.pdf filter=lfs diff=lfs merge=lfs -text
316
+ assets/pdfs/pg_0126.pdf filter=lfs diff=lfs merge=lfs -text
317
+ assets/pdfs/pg_0135.pdf filter=lfs diff=lfs merge=lfs -text
318
+ assets/pdfs/pg_0167.pdf filter=lfs diff=lfs merge=lfs -text
319
+ assets/pdfs/pg_0192.pdf filter=lfs diff=lfs merge=lfs -text
320
+ assets/pdfs/pg_0290.pdf filter=lfs diff=lfs merge=lfs -text
321
+ assets/pdfs/pg_0007.pdf filter=lfs diff=lfs merge=lfs -text
322
+ assets/pdfs/pg_0128.pdf filter=lfs diff=lfs merge=lfs -text
323
+ assets/pdfs/pg_0197.pdf filter=lfs diff=lfs merge=lfs -text
324
+ assets/pdfs/pg_0243.pdf filter=lfs diff=lfs merge=lfs -text
325
+ assets/pdfs/pg_0205.pdf filter=lfs diff=lfs merge=lfs -text
326
+ assets/pdfs/pg_0228.pdf filter=lfs diff=lfs merge=lfs -text
OCR_directory.sh ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # pdftk thesis.pdf burst
2
+
3
+ #using pdf2text, extract text for each page in assets/pdfs and store in asssets/txts with similar basename
4
+
5
+ for pdf in assets/pdfs/*.pdf
6
+ do
7
+ echo
8
+ #pdftotext $pdf assets/txts/$(basename $pdf .pdf).txt
9
+ #pdf2txt.py -o assets/txts/$(basename $pdf .pdf).txt $pdf
10
+ done
11
+
12
+
13
+ for pdf in assets/pdfs/*.pdf
14
+ do
15
+ convert -density 100 -quality 100 -colorspace RGB -alpha remove -alpha off $pdf assets/pngs/$(basename $pdf .pdf).png
16
+ done
17
+
app.py CHANGED
@@ -1,67 +1,114 @@
1
- import streamlit as st
2
- from llama_index import VectorStoreIndex
3
- from llama_index import ServiceContext
4
- from llama_index.embeddings import HuggingFaceEmbedding
5
- from llama_index.llms import HuggingFaceInferenceAPI
6
- from llama_index.schema import Document
7
- from PyPDF2 import PdfReader
8
-
9
- # Streamlit title and description
10
- st.title("PDF querying using Llama-Index by Rahul Bhoyar")
11
- st.write("Base Model: **HuggingFaceH4/zephyr-7b-alpha (open-source from HuggingFace)**")
12
- st.write("Embedding Model: **WhereIsAI/UAE-Large-V1 (open-source from HuggingFace)**")
13
- st.write("This app allows you to upload your own PDF and query your document.")
14
-
15
- hf_token = st.text_input("Enter your Hugging Face token:")
16
-
17
-
18
- def read_pdf(uploaded_file):
19
- pdf_reader = PdfReader(uploaded_file)
20
- text = ""
21
- for page_num in range(len(pdf_reader.pages)):
22
- text += pdf_reader.pages[page_num].extract_text()
23
- return text
24
-
25
-
26
- # Streamlit input for user file upload
27
- success = False
28
- query_engine_creation = False
29
- uploaded_pdf = st.file_uploader("Upload your PDF", type=['pdf'])
30
-
31
- # Load data and configure the index
32
- if uploaded_pdf is not None:
33
- file_contents = read_pdf(uploaded_pdf)
34
- documents = Document(text=file_contents)
35
- documents = [documents]
36
- st.success("Documents loaded successfully!")
37
-
38
- model = st.selectbox('Select the model', ('google/flan-t5-xxl','HuggingFaceH4/zephyr-7b-alpha'), index=0)
39
- llm = HuggingFaceInferenceAPI(model_name=model, token=hf_token)
40
-
41
- with st.spinner('Creating Vector Embeddings...'):
42
- embed_model_uae = HuggingFaceEmbedding(model_name="WhereIsAI/UAE-Large-V1")
43
- service_context = ServiceContext.from_defaults(
44
- llm=llm, chunk_size=800, chunk_overlap=20, embed_model=embed_model_uae
45
- )
46
- index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)
47
- index.storage_context.persist()
48
- query_engine = index.as_query_engine()
49
- query_engine_creation = True
50
- # Display the result of the task
51
- st.success("Vector embeddings created.")
52
- success = True
53
- else:
54
- st.write("Please upload a file first.")
55
-
56
- if query_engine_creation:
57
-
58
- # Streamlit input for user query
59
- if success:
60
- user_query = st.text_input("Enter your query:")
61
-
62
- # Query engine with user input
63
- if user_query:
64
- with st.spinner('Fetching the response...'):
65
- response = query_engine.query(user_query)
66
-
67
- st.markdown(f"**Response:** {response}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import BitsAndBytesConfig
3
+ from llama_index.llms.huggingface import HuggingFaceLLM
4
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
5
+ from llama_index.core import SimpleDirectoryReader
6
+ from llama_index.core import VectorStoreIndex, SummaryIndex
7
+ from llama_index.core.prompts import PromptTemplate
8
+ from llama_index.core import Settings
9
+
10
+
11
+ import gradio as gr
12
+
13
+
14
+ def messages_to_prompt(messages):
15
+ prompt = ""
16
+ for message in messages:
17
+ if message.role == "system":
18
+ m = "You are an expert in the research field of document understanding, bayesian deep learning and neural networks."
19
+ prompt += f"<|system|>\n{m}</s>\n"
20
+ elif message.role == "user":
21
+ prompt += f"<|user|>\n{message.content}</s>\n"
22
+ elif message.role == "assistant":
23
+ prompt += f"<|assistant|>\n{message.content}</s>\n"
24
+
25
+ # ensure we start with a system prompt, insert blank if needed
26
+ if not prompt.startswith("<|system|>\n"):
27
+ prompt = "<|system|>\n</s>\n" + prompt
28
+
29
+ # add final assistant prompt
30
+ prompt = prompt + "<|assistant|>\n"
31
+
32
+ return prompt
33
+
34
+
35
+ def load_RAG_pipeline():
36
+ # LLM
37
+ quantization_config = BitsAndBytesConfig(
38
+ load_in_4bit=True,
39
+ bnb_4bit_compute_dtype=torch.float16,
40
+ bnb_4bit_quant_type="nf4",
41
+ bnb_4bit_use_double_quant=True,
42
+ )
43
+
44
+ llm = HuggingFaceLLM(
45
+ model_name="HuggingFaceH4/zephyr-7b-alpha",
46
+ tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
47
+ query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
48
+ context_window=3900,
49
+ max_new_tokens=256,
50
+ model_kwargs={"quantization_config": quantization_config},
51
+ # tokenizer_kwargs={},
52
+ generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
53
+ messages_to_prompt=messages_to_prompt,
54
+ device_map="auto",
55
+ )
56
+
57
+ # Llama-index
58
+ Settings.llm = llm
59
+ Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
60
+ # Settings.chunk_size = 512
61
+ # Settings.chunk_overlap = 50
62
+
63
+ # raw data
64
+ documents = SimpleDirectoryReader("assets/txts").load_data()
65
+ vector_index = VectorStoreIndex.from_documents(documents)
66
+ # summary_index = SummaryIndex.from_documents(documents)
67
+ query_engine = vector_index.as_query_engine(response_mode="compact", similarity_top_k=3)
68
+ return query_engine
69
+
70
+
71
+ query_engine = load_RAG_pipeline()
72
+
73
+
74
+ # These are placeholder functions to simulate the behavior of the RAG setup.
75
+ # You would need to implement these with the actual logic to retrieve and generate answers based on the document.
76
+ def get_answer(question, temperature, nucleus_sampling, max_tokens):
77
+ # Here you should implement the logic to generate an answer based on the question and the document.
78
+ # For example, you could use a machine learning model for RAG.
79
+ # answer = "This is a placeholder answer."
80
+ # https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings/#setting-local-configurations
81
+ return query_engine.query(question)
82
+
83
+
84
+ def get_answer_page(question):
85
+ # Implement logic to retrieve the page number or an image of the page with the answer.
86
+ answer_page = "Page X - placeholder image."
87
+ return answer_page
88
+
89
+
90
+ # Create the gr.Interface function
91
+ def ask_my_thesis(question, temperature, nucleus_sampling, max_tokens):
92
+ answer = get_answer(question, temperature, nucleus_sampling, max_tokens)
93
+ answer_page = get_answer_page(question)
94
+ return answer, answer_page
95
+
96
+
97
+ # Set up the interface options based on the design in the image.
98
+ iface = gr.Interface(
99
+ fn=ask_my_thesis,
100
+ inputs=[
101
+ gr.Textbox(label="Question", placeholder="Type your question here..."),
102
+ gr.Slider(0, 1, value=0.7, label="Temperature"),
103
+ gr.Slider(0, 1, value=0.9, label="Nucleus Sampling"),
104
+ gr.Slider(1, 500, value=100, label="Max Generated Number of Tokens"),
105
+ ],
106
+ outputs=[gr.Textbox(label="Answer"), gr.Image(label="Answer Page")],
107
+ title="Ask my thesis",
108
+ description="Chat with the manuscript: ask questions and receive answers with references.",
109
+ allow_flagging="never",
110
+ )
111
+
112
+ # Start the application.
113
+ if __name__ == "__main__":
114
+ iface.launch()
assets/txts/pg_0002.txt ADDED
@@ -0,0 +1 @@
 
 
1
+
assets/txts/pg_0003.txt ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Intelligent Automation for AI-Driven Document
2
+ Understanding
3
+
4
+ Jordy VAN LANDEGHEM
5
+
6
+ Examination committee:
7
+ em. Prof. Dr. ir. Jean-Pierre Celis, chair
8
+ Prof. Dr. Marie-Francine Moens, supervisor
9
+ Prof. Dr. Matthew B. Blaschko, supervisor
10
+ Prof. Dr. ir. Johan Suykens
11
+ Prof. Dr. ir. Tinne Tuytelaars
12
+ Prof. Dr. Marcus Rohrbach
13
+ (TU Darmstadt)
14
+ Prof. Dr. Wenpeng Yin
15
+ (Penn State University)
16
+ Dr. Bertrand Anckaert
17
+ (Contract.fit)
18
+ March 2024
19
+
20
+ Dissertation presented in partial
21
+ fulfillment of the requirements for
22
+ the degree of Doctor of Engineering
23
+ Science (PhD): Computer Science
24
+
25
+
assets/txts/pg_0004.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ © 2024 KU Leuven – Faculty of Engineering Science
2
+ Uitgegeven in eigen beheer, Jordy Van Landeghem, Celestijnenlaan 200A box 2402, B-3001 Leuven (Belgium)
3
+
4
+ Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden
5
+ door middel van druk, fotokopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande
6
+ schriftelijke toestemming van de uitgever.
7
+ All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm,
8
+ electronic or any other means without written permission from the publisher.
9
+
10
+
assets/txts/pg_0005.txt ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Preface
2
+ This journey has been long and arduous, but I have finally reached an end. At
3
+ this end, I have a thesis that I am proud of, and I have learned a lot. As I look
4
+ back, I have been very fortunate to have had the support of many people, and I
5
+ would like to take this opportunity to thank them.
6
+ First and foremost, I would like to thank my supervisors, Sien and Matthew,
7
+ for their guidance and support throughout this journey. Sien has taught me
8
+ the importance of being thorough and meticulous, striving for diligence and
9
+ perfection from the get-go. I still remember how patiently she helped me with
10
+ my first paper, holding a Sunday afternoon call from her attic/home-office,
11
+ helping me hone the presentation and writing. Involving Matthew as the cosupervisor has been the best decision for my personal development, as he offered
12
+ a different perspective on my work, always challenging me to look at problems
13
+ from the lens of statistical theory and machine learning fundamentals. My
14
+ knee-jerk reaction to start implementing things as soon as possible was often
15
+ met with a “slow down, think about it first” from Matthew, which has been
16
+ invaluable in my development as a researcher. I am grateful to both of them
17
+ for their patience and understanding, and for giving me the freedom to explore
18
+ my own ideas and interests.
19
+ Next, a sincere thanks to my jury members, for taking the time to read my
20
+ thesis and for their valuable feedback. Furthermore, I would like to thank
21
+ het Vlaams Agentschap Innoveren & Ondernemen (VLAIO) for awarding the
22
+ Baekeland grant without which this PhD would not have been possible.
23
+ Pol & Bertrand, thanks for having me contribute to your dream to rid the
24
+ world of boring administrative processes and paperwork. Technically my bosses,
25
+ but in reality you are the embodiment of leadership by example, and I am
26
+ grateful for the many lessons I have learned from you. I am grateful for the
27
+ many opportunities you have given me to grow as a researcher and as a person.
28
+ Many thanks to my past and present colleagues at Contract.fit, for always
29
+
30
+ i
31
+
32
+
assets/txts/pg_0006.txt ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ii
2
+
3
+ PREFACE
4
+
5
+ preaching automation, inspiring me, and for having fun along the way. I am
6
+ grateful to my LIIR colleagues at KU Leuven, particularly the folks from office
7
+ 4.34 for the many interesting discussions and whiteboard sessions, whenever I
8
+ occasionally popped into the office.
9
+ I was fortunate to travel to many places during my PhD (Lausanne, Lisbon,
10
+ Barcelona, San Jose, Paris, Waikoloa), and I have met many people along the
11
+ way. My DUDEs, you have been the trigger to complete my PhD, reinvigorating
12
+ my passion for research and inspiring me for my future career. How crazy is it
13
+ that we conceived the seeds of the DUDE
14
+ project in a pirates bar, on a
15
+ hotel rooftop, and from a hospital bed after my back surgery?
16
+ Finally, I would like to thank my family and friends for their support and
17
+ encouragement throughout this journey. My parents, Peter en Nadine, you
18
+ have showed me that hard work pays off, and merci for the many sacrifices you
19
+ have made to give me the best possible education and life. Marijke, you are
20
+ the love of my life, and although I am not religious, you are my goddess, de
21
+ mammiej. Feliz, when you came into our lives, you added an extra dimension.
22
+ I used to see in 2D, now I see in 3D. Forever your father, your pappiej. Wes en
23
+ Jen, thanks for showing me to never give up, keep on pushing, even when you
24
+ are at your lowest, there is a way out, and only hard work will get you there.
25
+ Cornbois -Bryan, Emile, (even) Jan, for our friendship, I fail to make an
26
+ exhaustive definition. I wish for many more years of friendship from my likeminded brothers. John, Teunen, Wannes, if there is ever a zombie apocalypse, I
27
+ know that I can count on you to have my window. Kessel-city - Poohke, Vinny,
28
+ Kweinch etc., thanks for keeping on pushing the bar higher, and inspiring me
29
+ with your ambition and drive. Gustaf, thanks for the many laughs (#velleke)
30
+ and the much-needed distraction. Elstipoes, you are my oldest friend, and I am
31
+ grateful for the many years of friendship. Woutje, thanks for your contagious
32
+ optimism and the mancave during university. Leuvenbende, you were the
33
+ ones that made university fun and enjoyable. Individually and together you are
34
+ beautiful people, and I cherish our yearly reunions. Lauren en Yannick, thanks
35
+ for letting me win at Mario Kart. I might be forgetting some people, but I
36
+ would like to thank all my friends for bringing joy, for keeping me grounded,
37
+ and for reminding me that there is more to life than work.
38
+ Having studied literature in my Bachelor’s, it feels appropriate to finish with a
39
+ quote wrongly attributed to Ernest Hemingway: “Write drunk; edit sober.”
40
+ Jordy Van Landeghem
41
+ Gurdo, Pogomeister, Jorre, De Van Laaandeghem
42
+ February, 2024
43
+ Kessel, Belgium
44
+
45
+
assets/txts/pg_0007.txt ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Abstract
2
+ Human communication is increasingly document-based, requiring machines
3
+ to understand a wide variety of visually-rich documents to assist humans in
4
+ their daily lives. Amid the digital evolution, documents continue to facilitate
5
+ crucial human and organizational interactions but are tethered to manual
6
+ processing, causing inefficiency. We examine why organizations lag in adopting
7
+ automated document processing solutions and outline two primary challenges:
8
+ the complexity of processing long, multimodal documents algorithmically and
9
+ the necessity for reliability and control over associated risks. Automated decisionmaking is key to improving the efficiency of document processing, but the current
10
+ state-of-the-art technology is not yet reliable and robust enough to be deployed
11
+ in autonomous systems.
12
+ The practical objective set is to develop Intelligent Automation () systems
13
+ capable of estimating confidence in their actions, thereby increasing throughput
14
+ without accruing additional costs due to errors. We analyze the key challenges
15
+ and propose solutions to bridge the gap between research and practical
16
+ applications, with a focus on realistic datasets and experimental methodologies.
17
+ Building upon foundations of Document Understanding (), this dissertation
18
+ introduces advanced methodologies combining Machine Learning, Natural
19
+ Language Processing, and Computer Vision.
20
+ Addressing the evident gaps in research, this work presents novel methods
21
+ for predictive uncertainty quantification () alongside practical frameworks for
22
+ evaluating the robustness and reliability of DU technologies. The contribution
23
+ culminates in the introduction of two novel multipage document classification
24
+ datasets and a multifaceted benchmark, DUDE
25
+ , designed to rigorously
26
+ challenge and assess the state-of-the-art in DU. Extensive experiments across
27
+ these datasets reveal that while advancements have been made, significant
28
+ room for improvement remains, particularly in long-context modeling for
29
+ multipage document processing and calibrated, selective document visual
30
+ question answering. Efficient DU is also explored, revealing the effectiveness of
31
+ iii
32
+
33
+
assets/txts/pg_0008.txt ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ iv
2
+
3
+ ABSTRACT
4
+
5
+ knowledge distillation () model compression in visually-rich document layout
6
+ analysis () and classification.
7
+ Through empirical studies and methodological contributions, this dissertation
8
+ has the following contributions and findings:
9
+ First, in a benchmarking study of established methods on real-world text
10
+ classification, we find that our novel hybrid method ‘Concrete Dropout
11
+ Ensemble’ performs best, enhancing in-domain calibration and novel class
12
+ detection, even at a smaller ensemble size. Detailed ablation experiments
13
+ reveal the impact of prior, neural architecture, and hyperparameter choices on
14
+ estimation quality.
15
+ Second, on a prototypical DU task, we identify challenges in DU progress
16
+ and propose a formalization of multipage document classification scenarios,
17
+ constructed novel datasets, and conducted an experimental analysis showing
18
+ the promise of multipage representation learning and inference.
19
+ Third, we introduce DUDE, incorporating multifaceted challenges and principles
20
+ for a comprehensive evaluation of generic DU. Next to our own benchmarking,
21
+ we organize a competition, revealing that while newer document foundation
22
+ models show promise, they struggle with questions involving visual evidence or
23
+ complex reasoning. Moreover, we find severe problems in the ability of Large
24
+ Language Models (s) to reason about documents in their entirety, highlighting
25
+ issues with hallucination, long-context reasoning and control.
26
+ Fourth, we propose the first methodology for enriching documents with semantic
27
+ layout structure using distilled DLA models. We apply KD to visual document
28
+ tasks, unraveling the influence of various task and architecture components.
29
+ Finally, the dissertation concludes with a discussion of the findings and
30
+ implications for future research, emphasizing the need for advancements in
31
+ multipage document representation learning and the importance of realistic
32
+ datasets and experimental methodologies to measurably move forward to reliable
33
+ and robust IA-DU technology.
34
+
35
+
assets/txts/pg_0009.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Beknopte samenvatting
2
+ Menselijke communicatie is in toenemende mate documentgebaseerd, waarbij
3
+ machines een breed aanbod aan visueel-rijke documenten moeten begrijpen
4
+ om mensen in hun dagelijks leven te assisteren. Te midden van de digitale
5
+ evolutie blijven documenten cruciale menselijke en organisatorische interacties
6
+ faciliteren, maar zijn ze gebonden aan handmatige verwerking, wat inefficiëntie
7
+ veroorzaakt. We onderzoeken waarom organisaties achterblijven bij het
8
+ adopteren van geautomatiseerde documentverwerkingsoplossingen en schetsen
9
+ twee primaire uitdagingen: de complexiteit van het algoritmisch verwerken van
10
+ lange, multimodale documenten en de noodzaak van betrouwbaarheid en controle
11
+ over daarmee samenhangende risico’s. Geautomatiseerde besluitvorming is
12
+ essentieel voor het verbeteren van de efficiëntie van documentverwerking, maar
13
+ de huidige stand van de technologie is nog niet betrouwbaar en robuust genoeg
14
+ om ingezet te worden in autonome toepassingen.
15
+ Het praktische doel dat gesteld wordt, is het ontwikkelen van systemen voor
16
+ Intelligente Automatisering (IA) die in staat zijn om vertrouwen in hun acties te
17
+ schatten, daarmee de doorvoer verhogend zonder extra kosten vanwege fouten.
18
+ We analyseren de belangrijkste uitdagingen en stellen oplossingen voor om de
19
+ kloof tussen onderzoek en praktische toepassingen te overbruggen, met een focus
20
+ op realistische datasets en experimentele methodologieën. Voortbouwend op
21
+ de fundamenten van Documentinterpretatie (DI), introduceert dit proefschrift
22
+ geavanceerde methodologieën die Machinaal Leren, Natuurlijke Taalverwerking
23
+ en Computer Visie combineren.
24
+ Door de duidelijke hiaten in onderzoek aan te pakken, presenteert dit werk
25
+ nieuwe methoden voor predictieve onzekerheidskwantificering (POK) naast
26
+ praktische kaders voor het evalueren van de robuustheid en betrouwbaarheid
27
+ van DI-technologieën. De bijdrage culmineert in de introductie van twee
28
+ nieuwe datasets voor classificatie van multipagina documenten en een veelzijdige
29
+ benchmark, DUDE
30
+ , ontworpen om de state-of-the-art in DI rigoureus
31
+ uit te dagen en te beoordelen. Uitgebreide experimenten met deze datasets
32
+ v
33
+
34
+
assets/txts/pg_0010.txt ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ vi
2
+
3
+ BEKNOPTE SAMENVATTING
4
+
5
+ onthullen dat er weliswaar vooruitgang is geboekt, maar dat er nog significant
6
+ veel ruimte is voor verbetering, met name in de lange-contextmodellering voor
7
+ de verwerking van multipagina documenten en gekalibreerd, selectief visueel
8
+ vraagbeantwoording van documenten. Meer schaalbaar DI wordt ook verkend,
9
+ waarbij de effectiviteit van kennisdistillatie (KD) voor modelcompressie in
10
+ visueel-rijke layoutanalyse (DLA) en classificatie van documenten aan het licht
11
+ komt.
12
+ Door middel van empirische studies en methodologische bijdragen, heeft dit
13
+ proefschrift de volgende bijdragen en bevindingen:
14
+ Ten eerste vinden we in een benchmarkstudie van gevestigde POK-methoden
15
+ op tekstclassificatie in de echte wereld dat onze nieuwe hybride POK-methode
16
+ ’Concrete Dropout Ensemble’ het beste presteert, de kalibratie binnenshuis
17
+ verbeterend en detectie van nieuwe klassen, zelfs met een kleiner ensemble.
18
+ Gedetailleerde ablatie-experimenten onthullen de impact van voorafgaande
19
+ kennis, neurale architectuur en keuzes van hyperparameters op de kwaliteit van
20
+ POK-schatting.
21
+ Ten tweede identificeren we uitdagingen in de vooruitgang van DI en stellen een
22
+ formalisatie voor van multipagina documentclassificatiescenario’s, bouwen novel
23
+ datasets, en voeren een experimentele analyse uit die de belofte van multipagina
24
+ representatie-leren en inferentie toont.
25
+ Ten derde introduceren we DUDE, waarin veelzijdige uitdagingen en principes
26
+ worden voorgesteld voor een uitgebreide evaluatie.
27
+ Naast onze eigen
28
+ benchmarking organiseren we een competitie, waaruit blijkt dat hoewel nieuwere
29
+ modellen veelbelovend zijn, ze het moeilijk hebben met vragen die visueel bewijs
30
+ of complex redeneren vereisen. Bovendien vinden we ernstige problemen in het
31
+ vermogen van Grote Taalmodellen (LLMs) om over documenten in hun geheel
32
+ te redeneren, wat problemen benadrukt met hallucinatie, redeneren met lange
33
+ context en controle.
34
+ Ten vierde stellen we de eerste experimentele methodologie voor om documenten
35
+ te verrijken met semantische layoutstructuur met behulp van gedestilleerde
36
+ DLA-modellen. We passen KD toe op visuele documenttaken, waarbij we de
37
+ invloed van verschillende architectuurcomponenten van taken ontrafelen.
38
+ Ten slotte sluit het proefschrift af met een bespreking van de bevindingen en
39
+ implicaties voor toekomstig onderzoek, waarbij de noodzaak wordt benadrukt
40
+ voor vooruitgang in multipagina documentrepresentatie-leren en het belang van
41
+ realistische datasets en experimentele methodologieën om meetbaar vooruitgang
42
+ te boeken naar betrouwbare en robuuste IA-DI technologie.
43
+
44
+
assets/txts/pg_0013.txt ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ List of Abbreviations
2
+ AAPD Arxiv Academic Paper Dataset
3
+ Acc_ID Accuracy in-domain
4
+ Acc_OOD Accuracy out of domain
5
+ AI Artificial Intelligence
6
+ ANLS Average Normalized Levenshtein Similarity
7
+ AUPR Area Under the Precision-Recall Curve
8
+ AURC Area-Under-Risk-Coverage-Curve
9
+ AUROC Area Under the Receiver Operating Characteristic curve
10
+ BDL Bayesian Deep Learning
11
+ BNN Bayesian Neural Network
12
+ BPM Business Process Management
13
+ CE Cross-Entropy
14
+ CER Character Error Rate
15
+ COCO Common Objects in Context
16
+ CSF Confidence Scoring Function
17
+ CV Computer Vision
18
+ DC Document Classification
19
+ DG Document Generation
20
+ ix
21
+
22
+
assets/txts/pg_0014.txt ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ x
2
+
3
+ List of Abbreviations
4
+
5
+ DL Deep Learning
6
+ DLA Document Layout Analysis
7
+ DNN Deep Neural Network
8
+ DocAI Document AI
9
+ DocVQA Document Visual Question Answering
10
+ DOD Document Object Detection
11
+ DU Document Understanding
12
+ DUDE Document UnderstanDing of Everything
13
+ ECE Expected Calibration Error
14
+ ELBO Evidence Lower Bound
15
+ ERM Empirical Risk Minimization
16
+ FasterRCNN Faster Region-based Convolutional Neural Network
17
+ FP False Positives
18
+ IA Intelligent Automation
19
+ ICDAR International Conference on Document Analysis and Recognition
20
+ IDP Intelligent Document Processing
21
+ i.i.d. Independent and Identically Distributed
22
+ IOB/IOBES Inside, Outside, Beginning / End, Single
23
+ KD Knowledge Distillation
24
+ KIE Key Information Extraction
25
+ LLM Large Language Model
26
+ MAP Maximum-a-Posteriori
27
+ mAP Mean Average Precision
28
+ MCD Monte Carlo Dropout
29
+
30
+
assets/txts/pg_0015.txt ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ List of Abbreviations
2
+
3
+ MCMC Markov Chain Monte-Carlo
4
+ MDLT Multi-Domain Long-Tailed Recognition
5
+ MECE Mutually Exclusive and Collectively Exhaustive
6
+ MI Mutual Information
7
+ ML Machine Learning
8
+ MSE Mean Squared Error
9
+ MSP Maximum Softmax Probability
10
+ MU Model Uncertainty
11
+ NLG Natural Language Generation
12
+ NLL Negative Log Likelihood
13
+ NLP Natural Language Processing
14
+ NN Neural Network
15
+ OCR Optical Character Recognition
16
+ OOD Out-of-Distribution
17
+ PCC Pearson Correlation Coefficient
18
+ PUQ Predictive Uncertainty Quantification
19
+ RERM Regularized Empirical Risk Minimization
20
+ ResNet Residual Network
21
+ RPA Robotic Process Automation
22
+ SaaS Software-as-a-service
23
+ SNGP Spectral-normalized Neural Gaussian Process
24
+ SOTA State-of-the-art
25
+ STP Straight-Through-Processing
26
+ TSR Table Structure Recognition
27
+
28
+ xi
29
+
30
+
assets/txts/pg_0016.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ xii
2
+
3
+ VDU Visual Document Understanding
4
+ VI Variational Inference
5
+ VLM Vision Language Model
6
+ VQA Visual Question Answering
7
+ VRD Visually-Rich Document
8
+ WER Word Error Rate
9
+
10
+ LIST OF ABBREVIATIONS
11
+
12
+
assets/txts/pg_0017.txt ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Contents
2
+ Abstract
3
+
4
+ iii
5
+
6
+ Beknopte samenvatting
7
+
8
+ v
9
+
10
+ List of Abbreviations
11
+
12
+ xii
13
+
14
+ Contents
15
+
16
+ xiii
17
+
18
+ List of Figures
19
+
20
+ xix
21
+
22
+ List of Tables
23
+
24
+ xxv
25
+
26
+ 1 Introduction
27
+ 1.1 Research Context . . . . . . . . . . . . . . . . . . . . . .
28
+ 1.2 Problem Statement and Questions . . . . . . . . . . . .
29
+ 1.2.1 Reliable and Robust Deep Learning . . . . . . .
30
+ 1.2.2 Realistic and Efficient Document Understanding
31
+ 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
+
33
+ .
34
+ .
35
+ .
36
+ .
37
+ .
38
+
39
+ .
40
+ .
41
+ .
42
+ .
43
+ .
44
+
45
+ .
46
+ .
47
+ .
48
+ .
49
+ .
50
+
51
+ .
52
+ .
53
+ .
54
+ .
55
+ .
56
+
57
+ 1
58
+ 4
59
+ 6
60
+ 6
61
+ 7
62
+ 9
63
+
64
+ 2 Fundamentals
65
+ 2.1 Statistical Learning . . . . . . . . . . . . . . . .
66
+ 2.1.1 Neural Networks . . . . . . . . . . . . .
67
+ 2.1.2 Probabilistic Evaluation . . . . . . . . .
68
+ 2.1.3 Architectures . . . . . . . . . . . . . . .
69
+ 2.1.3.1 Convolutional Neural Networks
70
+ 2.1.3.2 Language Neural Networks . .
71
+ 2.1.3.3 Transformer Network . . . . .
72
+ 2.2 Reliability and Robustness . . . . . . . . . . . .
73
+ 2.2.1 Generalization and Adaptation . . . . .
74
+ 2.2.2 Confidence Estimation . . . . . . . . . .
75
+ 2.2.3 Evaluation Metrics . . . . . . . . . . . .
76
+
77
+ .
78
+ .
79
+ .
80
+ .
81
+ .
82
+ .
83
+ .
84
+ .
85
+ .
86
+ .
87
+ .
88
+
89
+ .
90
+ .
91
+ .
92
+ .
93
+ .
94
+ .
95
+ .
96
+ .
97
+ .
98
+ .
99
+ .
100
+
101
+ .
102
+ .
103
+ .
104
+ .
105
+ .
106
+ .
107
+ .
108
+ .
109
+ .
110
+ .
111
+ .
112
+
113
+ .
114
+ .
115
+ .
116
+ .
117
+ .
118
+ .
119
+ .
120
+ .
121
+ .
122
+ .
123
+ .
124
+
125
+ 11
126
+ 12
127
+ 14
128
+ 15
129
+ 16
130
+ 17
131
+ 18
132
+ 19
133
+ 21
134
+ 22
135
+ 23
136
+ 24
137
+
138
+ xiii
139
+
140
+ .
141
+ .
142
+ .
143
+ .
144
+ .
145
+ .
146
+ .
147
+ .
148
+ .
149
+ .
150
+ .
151
+
152
+ .
153
+ .
154
+ .
155
+ .
156
+ .
157
+ .
158
+ .
159
+ .
160
+ .
161
+ .
162
+ .
163
+
164
+ .
165
+ .
166
+ .
167
+ .
168
+ .
169
+ .
170
+ .
171
+ .
172
+ .
173
+ .
174
+ .
175
+
176
+ .
177
+ .
178
+ .
179
+ .
180
+ .
181
+ .
182
+ .
183
+ .
184
+ .
185
+ .
186
+ .
187
+
188
+ .
189
+ .
190
+ .
191
+ .
192
+ .
193
+ .
194
+ .
195
+ .
196
+ .
197
+ .
198
+ .
199
+
200
+
assets/txts/pg_0018.txt ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ xiv
2
+
3
+ CONTENTS
4
+
5
+ 2.3
6
+
7
+ 2.4
8
+
9
+ I
10
+
11
+ 2.2.4 Calibration . . . . . . . . . . . . . . . .
12
+ 2.2.5 Predictive Uncertainty Quantification .
13
+ 2.2.6 Failure Prediction . . . . . . . . . . . .
14
+ Document Understanding . . . . . . . . . . . .
15
+ 2.3.1 Task Definitions . . . . . . . . . . . . .
16
+ 2.3.2 Datasets . . . . . . . . . . . . . . . . . .
17
+ 2.3.3 Models . . . . . . . . . . . . . . . . . .
18
+ 2.3.4 Challenges in Document Understanding
19
+ 2.3.4.1 Long-Context Modeling . . . .
20
+ 2.3.4.2 Document Structure Modeling
21
+ Intelligent Automation . . . . . . . . . . . . . .
22
+
23
+ .
24
+ .
25
+ .
26
+ .
27
+ .
28
+ .
29
+ .
30
+ .
31
+ .
32
+ .
33
+ .
34
+
35
+ .
36
+ .
37
+ .
38
+ .
39
+ .
40
+ .
41
+ .
42
+ .
43
+ .
44
+ .
45
+ .
46
+
47
+ .
48
+ .
49
+ .
50
+ .
51
+ .
52
+ .
53
+ .
54
+ .
55
+ .
56
+ .
57
+ .
58
+
59
+ .
60
+ .
61
+ .
62
+ .
63
+ .
64
+ .
65
+ .
66
+ .
67
+ .
68
+ .
69
+ .
70
+
71
+ .
72
+ .
73
+ .
74
+ .
75
+ .
76
+ .
77
+ .
78
+ .
79
+ .
80
+ .
81
+ .
82
+
83
+ .
84
+ .
85
+ .
86
+ .
87
+ .
88
+ .
89
+ .
90
+ .
91
+ .
92
+ .
93
+ .
94
+
95
+ .
96
+ .
97
+ .
98
+ .
99
+ .
100
+ .
101
+ .
102
+ .
103
+ .
104
+ .
105
+ .
106
+
107
+ .
108
+ .
109
+ .
110
+ .
111
+ .
112
+ .
113
+ .
114
+ .
115
+ .
116
+ .
117
+ .
118
+
119
+ .
120
+ .
121
+ .
122
+ .
123
+ .
124
+ .
125
+ .
126
+ .
127
+ .
128
+ .
129
+ .
130
+
131
+ Reliable and Robust Deep Learning
132
+
133
+ 3 Benchmarking Scalable Predictive Uncertainty in Text Classification
134
+ 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
135
+ 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
136
+ 3.3 Uncertainty Methods . . . . . . . . . . . . . . . . . . . . . . . .
137
+ 3.3.1 Quantifying Uncertainty in Deep Learning . . . . . . . .
138
+ 3.3.2 Predictive Uncertainty Methods . . . . . . . . . . . . .
139
+ 3.3.2.1 Monte Carlo Dropout . . . . . . . . . . . . . .
140
+ 3.3.2.2 Deep Ensemble . . . . . . . . . . . . . . . . . .
141
+ 3.3.2.3 Concrete Dropout . . . . . . . . . . . . . . . .
142
+ 3.3.2.4 Heteroscedastic Extensions . . . . . . . . . . .
143
+ 3.3.3 Uncertainty Estimation . . . . . . . . . . . . . . . . . .
144
+ 3.3.4 Motivating Hybrid Approaches . . . . . . . . . . . . . .
145
+ 3.3.5 Uncertainty Calibration under Distribution Shift . . . .
146
+ 3.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . .
147
+ 3.4.1 Proposed Hybrid Approaches . . . . . . . . . . . . . . .
148
+ 3.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .
149
+ 3.4.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . .
150
+ 3.4.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . .
151
+ 3.4.5 Experimental design . . . . . . . . . . . . . . . . . . . .
152
+ 3.4.5.1 In-domain Setting . . . . . . . . . . . . . . . .
153
+ 3.4.5.2 Cross-domain Setting . . . . . . . . . . . . . .
154
+ 3.4.5.3 Novelty Detection Setting . . . . . . . . . . . .
155
+ 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
156
+ 3.5.1 Experiment: In-domain . . . . . . . . . . . . . . . . . .
157
+ 3.5.2 Experiment: Cross-domain . . . . . . . . . . . . . . . .
158
+ 3.5.3 Experiment: Novelty Detection . . . . . . . . . . . . . .
159
+ 3.5.4 Experiment: Ablations . . . . . . . . . . . . . . . . . . .
160
+ 3.5.4.1 Diversity . . . . . . . . . . . . . . . . . . . . .
161
+
162
+ 28
163
+ 30
164
+ 32
165
+ 33
166
+ 35
167
+ 36
168
+ 37
169
+ 38
170
+ 39
171
+ 40
172
+ 41
173
+
174
+ 43
175
+ 44
176
+ 46
177
+ 48
178
+ 51
179
+ 51
180
+ 52
181
+ 53
182
+ 53
183
+ 54
184
+ 54
185
+ 55
186
+ 58
187
+ 59
188
+ 61
189
+ 61
190
+ 63
191
+ 64
192
+ 66
193
+ 66
194
+ 67
195
+ 67
196
+ 68
197
+ 69
198
+ 70
199
+ 71
200
+ 73
201
+ 75
202
+ 76
203
+
204
+
assets/txts/pg_0019.txt ADDED
@@ -0,0 +1,394 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ CONTENTS
2
+
3
+ 3.6
4
+ 3.7
5
+
6
+ 3.8
7
+ 3.9
8
+
9
+ II
10
+
11
+ xv
12
+
13
+ 3.5.4.2 NLP Architecture . . . . . . . . . .
14
+ 3.5.4.3 Ensemble size M . . . . . . . . . . .
15
+ 3.5.4.4 Concrete Dropout p . . . . . . . . .
16
+ Discussion . . . . . . . . . . . . . . . . . . . . . . . .
17
+ Additional Uncertainty Approaches . . . . . . . . . .
18
+ 3.7.1 Stochastic Gradient MCMC Methods . . . .
19
+ 3.7.2 Spectral-normalized Neural Gaussian Process
20
+ 3.7.2.1 SNGP Results . . . . . . . . . . . .
21
+ 3.7.2.2 SNGP Discussion . . . . . . . . . .
22
+ Limitations . . . . . . . . . . . . . . . . . . . . . . .
23
+ Chapter Conclusion . . . . . . . . . . . . . . . . . .
24
+
25
+ .
26
+ .
27
+ .
28
+ .
29
+ .
30
+ .
31
+ .
32
+ .
33
+ .
34
+ .
35
+ .
36
+
37
+ .
38
+ .
39
+ .
40
+ .
41
+ .
42
+ .
43
+ .
44
+ .
45
+ .
46
+ .
47
+ .
48
+
49
+ .
50
+ .
51
+ .
52
+ .
53
+ .
54
+ .
55
+ .
56
+ .
57
+ .
58
+ .
59
+ .
60
+
61
+ .
62
+ .
63
+ .
64
+ .
65
+ .
66
+ .
67
+ .
68
+ .
69
+ .
70
+ .
71
+ .
72
+
73
+ .
74
+ .
75
+ .
76
+ .
77
+ .
78
+ .
79
+ .
80
+ .
81
+ .
82
+ .
83
+ .
84
+
85
+ .
86
+ .
87
+ .
88
+ .
89
+ .
90
+ .
91
+ .
92
+ .
93
+ .
94
+ .
95
+ .
96
+
97
+ Realistic and Efficient Document Understanding
98
+
99
+ 4 Beyond Document Page Classification: Design,
100
+ Challenges
101
+ 4.1 Introduction . . . . . . . . . . . . . . . . . . . .
102
+ 4.2 Problem Formulation . . . . . . . . . . . . . . .
103
+ 4.3 Balancing Research & Applications . . . . . . .
104
+ 4.4 Experimental Study . . . . . . . . . . . . . . .
105
+ 4.5 Challenges and Guidelines . . . . . . . . . . . .
106
+ 4.5.1 Divergence of Tasks: f . . . . . . . . . .
107
+ 4.5.2 Divergence of Label Space: Y . . . . . .
108
+ 4.5.3 Divergence of Input Data: X . . . . . .
109
+ 4.5.4 Maturity of Evaluation Methodology . .
110
+ 4.6 Chapter Conclusion . . . . . . . . . . . . . . .
111
+ 5 Document UnderstanDing of Everything (DUDE
112
+ 5.1 Introduction . . . . . . . . . . . . . . . . . . .
113
+ 5.2 Related Work . . . . . . . . . . . . . . . . . .
114
+ 5.3 DUDE Dataset . . . . . . . . . . . . . . . .
115
+ 5.3.1 Gathering Documents . . . . . . . . .
116
+ 5.3.2 Annotation Process . . . . . . . . . .
117
+ 5.3.3 Dataset Statistics . . . . . . . . . . . .
118
+ 5.3.4 Diagnostic Subsets . . . . . . . . . . .
119
+ 5.3.5 Evaluation . . . . . . . . . . . . . . .
120
+ 5.4 DUDE Competition . . . . . . . . . . . . . .
121
+ 5.4.1 Challenge Objectives . . . . . . . . . .
122
+ 5.4.2 Challenge Contributions . . . . . . . .
123
+ 5.4.3 Motivation and Scope . . . . . . . . .
124
+ 5.4.3.1 Desired Generalization. . . .
125
+
126
+ .
127
+ .
128
+ .
129
+ .
130
+ .
131
+ .
132
+ .
133
+ .
134
+ .
135
+ .
136
+ .
137
+ .
138
+ .
139
+
140
+ 77
141
+ 79
142
+ 80
143
+ 81
144
+ 85
145
+ 86
146
+ 87
147
+ 88
148
+ 90
149
+ 90
150
+ 91
151
+
152
+ 94
153
+
154
+ Datasets, and
155
+ .
156
+ .
157
+ .
158
+ .
159
+ .
160
+ .
161
+ .
162
+ .
163
+ .
164
+ .
165
+
166
+ .
167
+ .
168
+ .
169
+ .
170
+ .
171
+ .
172
+ .
173
+ .
174
+ .
175
+ .
176
+
177
+ .
178
+ .
179
+ .
180
+ .
181
+ .
182
+ .
183
+ .
184
+ .
185
+ .
186
+ .
187
+
188
+ .
189
+ .
190
+ .
191
+ .
192
+ .
193
+ .
194
+ .
195
+ .
196
+ .
197
+ .
198
+
199
+ .
200
+ .
201
+ .
202
+ .
203
+ .
204
+ .
205
+ .
206
+ .
207
+ .
208
+ .
209
+
210
+ .
211
+ .
212
+ .
213
+ .
214
+ .
215
+ .
216
+ .
217
+ .
218
+ .
219
+ .
220
+
221
+ .
222
+ .
223
+ .
224
+ .
225
+ .
226
+ .
227
+ .
228
+ .
229
+ .
230
+ .
231
+
232
+ .
233
+ .
234
+ .
235
+ .
236
+ .
237
+ .
238
+ .
239
+ .
240
+ .
241
+ .
242
+
243
+ .
244
+ .
245
+ .
246
+ .
247
+ .
248
+ .
249
+ .
250
+ .
251
+ .
252
+ .
253
+
254
+ 95
255
+ 97
256
+ 98
257
+ 101
258
+ 104
259
+ 107
260
+ 107
261
+ 108
262
+ 109
263
+ 111
264
+ 111
265
+
266
+ .
267
+ .
268
+ .
269
+ .
270
+ .
271
+ .
272
+ .
273
+ .
274
+ .
275
+ .
276
+ .
277
+ .
278
+ .
279
+
280
+ )
281
+ . .
282
+ . .
283
+ . .
284
+ . .
285
+ . .
286
+ . .
287
+ . .
288
+ . .
289
+ . .
290
+ . .
291
+ . .
292
+ . .
293
+ . .
294
+
295
+ .
296
+ .
297
+ .
298
+ .
299
+ .
300
+ .
301
+ .
302
+ .
303
+ .
304
+ .
305
+ .
306
+ .
307
+ .
308
+
309
+ .
310
+ .
311
+ .
312
+ .
313
+ .
314
+ .
315
+ .
316
+ .
317
+ .
318
+ .
319
+ .
320
+ .
321
+ .
322
+
323
+ .
324
+ .
325
+ .
326
+ .
327
+ .
328
+ .
329
+ .
330
+ .
331
+ .
332
+ .
333
+ .
334
+ .
335
+ .
336
+
337
+ .
338
+ .
339
+ .
340
+ .
341
+ .
342
+ .
343
+ .
344
+ .
345
+ .
346
+ .
347
+ .
348
+ .
349
+ .
350
+
351
+ .
352
+ .
353
+ .
354
+ .
355
+ .
356
+ .
357
+ .
358
+ .
359
+ .
360
+ .
361
+ .
362
+ .
363
+ .
364
+
365
+ .
366
+ .
367
+ .
368
+ .
369
+ .
370
+ .
371
+ .
372
+ .
373
+ .
374
+ .
375
+ .
376
+ .
377
+ .
378
+
379
+ 113
380
+ 116
381
+ 117
382
+ 118
383
+ 121
384
+ 121
385
+ 123
386
+ 125
387
+ 126
388
+ 128
389
+ 128
390
+ 129
391
+ 129
392
+ 130
393
+
394
+
assets/txts/pg_0020.txt ADDED
@@ -0,0 +1,320 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ xvi
2
+
3
+ CONTENTS
4
+
5
+ 5.4.4
6
+
7
+ 5.5
8
+ 5.6
9
+
10
+ 5.7
11
+ 5.8
12
+
13
+ DUDE Competition Protocol . . . . . . . .
14
+ 5.4.4.1 Task Formulation . . . . . . . . . .
15
+ 5.4.4.2 Evaluation Protocol . . . . . . . . .
16
+ DUDE Benchmark . . . . . . . . . . . . . . . . . .
17
+ 5.5.1 Baselines . . . . . . . . . . . . . . . . . . . .
18
+ 5.5.2 Analysis & Discussion . . . . . . . . . . . . .
19
+ Detailed Results Analysis . . . . . . . . . . . . . . .
20
+ 5.6.1 Within Model Class Analysis . . . . . . . . .
21
+ 5.6.1.1 Encoder vs. Decoder . . . . . . . .
22
+ 5.6.1.2 Incorporating Layout & Vision . . .
23
+ 5.6.1.3 Toward Long Document Processing
24
+ 5.6.1.4 Diagnosis of LLM Results . . . . . .
25
+ 5.6.2 Assessing Confidence . . . . . . . . . . . . . .
26
+ DUDE Competition Results . . . . . . . . . . . . .
27
+ 5.7.1 Submitted Methods . . . . . . . . . . . . . .
28
+ 5.7.2 Performance Analysis . . . . . . . . . . . . .
29
+ Chapter Conclusion . . . . . . . . . . . . . . . . . .
30
+
31
+ 6 DistilDoc: Knowledge Distillation for Visually-Rich
32
+ Applications
33
+ 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
34
+ 6.2 Related Work . . . . . . . . . . . . . . . . . . . . .
35
+ 6.3 Experimental Setup . . . . . . . . . . . . . . . . .
36
+ 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . .
37
+ 6.3.2 Architectures and Backbones . . . . . . . .
38
+ 6.3.3 KD Methods . . . . . . . . . . . . . . . . .
39
+ 6.3.4 Evaluation . . . . . . . . . . . . . . . . . .
40
+ 6.3.5 DLA-enriched LLM prompting . . . . . . .
41
+ 6.4 Results & Discussion . . . . . . . . . . . . . . . . .
42
+ 6.5 Chapter Conclusion . . . . . . . . . . . . . . . . .
43
+
44
+ .
45
+ .
46
+ .
47
+ .
48
+ .
49
+ .
50
+ .
51
+ .
52
+ .
53
+ .
54
+ .
55
+ .
56
+ .
57
+ .
58
+ .
59
+ .
60
+ .
61
+
62
+ .
63
+ .
64
+ .
65
+ .
66
+ .
67
+ .
68
+ .
69
+ .
70
+ .
71
+ .
72
+ .
73
+ .
74
+ .
75
+ .
76
+ .
77
+ .
78
+ .
79
+
80
+ .
81
+ .
82
+ .
83
+ .
84
+ .
85
+ .
86
+ .
87
+ .
88
+ .
89
+ .
90
+ .
91
+ .
92
+ .
93
+ .
94
+ .
95
+ .
96
+ .
97
+
98
+ .
99
+ .
100
+ .
101
+ .
102
+ .
103
+ .
104
+ .
105
+ .
106
+ .
107
+ .
108
+ .
109
+ .
110
+ .
111
+ .
112
+ .
113
+ .
114
+ .
115
+
116
+ .
117
+ .
118
+ .
119
+ .
120
+ .
121
+ .
122
+ .
123
+ .
124
+ .
125
+ .
126
+ .
127
+ .
128
+ .
129
+ .
130
+ .
131
+ .
132
+ .
133
+
134
+ .
135
+ .
136
+ .
137
+ .
138
+ .
139
+ .
140
+ .
141
+ .
142
+ .
143
+ .
144
+ .
145
+ .
146
+ .
147
+ .
148
+ .
149
+ .
150
+ .
151
+
152
+ 131
153
+ 132
154
+ 132
155
+ 133
156
+ 133
157
+ 134
158
+ 136
159
+ 136
160
+ 136
161
+ 136
162
+ 136
163
+ 137
164
+ 138
165
+ 138
166
+ 138
167
+ 139
168
+ 144
169
+
170
+ Document
171
+ .
172
+ .
173
+ .
174
+ .
175
+ .
176
+ .
177
+ .
178
+ .
179
+ .
180
+ .
181
+
182
+ .
183
+ .
184
+ .
185
+ .
186
+ .
187
+ .
188
+ .
189
+ .
190
+ .
191
+ .
192
+
193
+ .
194
+ .
195
+ .
196
+ .
197
+ .
198
+ .
199
+ .
200
+ .
201
+ .
202
+ .
203
+
204
+ .
205
+ .
206
+ .
207
+ .
208
+ .
209
+ .
210
+ .
211
+ .
212
+ .
213
+ .
214
+
215
+ .
216
+ .
217
+ .
218
+ .
219
+ .
220
+ .
221
+ .
222
+ .
223
+ .
224
+ .
225
+
226
+ .
227
+ .
228
+ .
229
+ .
230
+ .
231
+ .
232
+ .
233
+ .
234
+ .
235
+ .
236
+
237
+ 145
238
+ 147
239
+ 149
240
+ 151
241
+ 152
242
+ 153
243
+ 155
244
+ 157
245
+ 158
246
+ 158
247
+ 163
248
+
249
+ 7 Conclusion
250
+ 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
251
+ 7.2 Perspectives For Future Research . . . . . . . . . . . .
252
+ 7.2.1 Open Problems In Reliability & Robustness . .
253
+ 7.2.2 A Future-Proof Design Of IA-DU . . . . . . . .
254
+ 7.2.2.1 The ‘Ultimate’ DU Dataset? . . . . .
255
+ 7.2.2.2 A Feature-complete IA-DU Solution?
256
+
257
+ .
258
+ .
259
+ .
260
+ .
261
+ .
262
+ .
263
+
264
+ .
265
+ .
266
+ .
267
+ .
268
+ .
269
+ .
270
+
271
+ .
272
+ .
273
+ .
274
+ .
275
+ .
276
+ .
277
+
278
+ .
279
+ .
280
+ .
281
+ .
282
+ .
283
+ .
284
+
285
+ .
286
+ .
287
+ .
288
+ .
289
+ .
290
+ .
291
+
292
+ 165
293
+ 165
294
+ 171
295
+ 172
296
+ 173
297
+ 173
298
+ 178
299
+
300
+ Bibliography
301
+
302
+ .
303
+ .
304
+ .
305
+ .
306
+ .
307
+ .
308
+ .
309
+ .
310
+ .
311
+ .
312
+
313
+ 181
314
+
315
+ A Appendix - PUQ
316
+ 223
317
+ A
318
+ Implementation Details . . . . . . . . . . . . . . . . . . . . . . 223
319
+
320
+
assets/txts/pg_0021.txt ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ CONTENTS
2
+
3
+ B
4
+ C
5
+
6
+ xvii
7
+
8
+ A.1
9
+ Software and Data . . . . . . . . . .
10
+ A.2
11
+ Hyperparameter Defaults . . . . . .
12
+ Practical Considerations . . . . . . . . . . .
13
+ B.1
14
+ Take-home Summary . . . . . . . . .
15
+ B.2
16
+ Compute vs. Performance Trade-off
17
+ Detailed Experiment Results . . . . . . . .
18
+ C.1
19
+ Zoom-in Benchmark Evidence . . . .
20
+ C.2
21
+ Absolute Benchmark Results . . . .
22
+
23
+ .
24
+ .
25
+ .
26
+ .
27
+ .
28
+ .
29
+ .
30
+ .
31
+
32
+ .
33
+ .
34
+ .
35
+ .
36
+ .
37
+ .
38
+ .
39
+ .
40
+
41
+ .
42
+ .
43
+ .
44
+ .
45
+ .
46
+ .
47
+ .
48
+ .
49
+
50
+ .
51
+ .
52
+ .
53
+ .
54
+ .
55
+ .
56
+ .
57
+ .
58
+
59
+ .
60
+ .
61
+ .
62
+ .
63
+ .
64
+ .
65
+ .
66
+ .
67
+
68
+ .
69
+ .
70
+ .
71
+ .
72
+ .
73
+ .
74
+ .
75
+ .
76
+
77
+ .
78
+ .
79
+ .
80
+ .
81
+ .
82
+ .
83
+ .
84
+ .
85
+
86
+ .
87
+ .
88
+ .
89
+ .
90
+ .
91
+ .
92
+ .
93
+ .
94
+
95
+ .
96
+ .
97
+ .
98
+ .
99
+ .
100
+ .
101
+ .
102
+ .
103
+
104
+ .
105
+ .
106
+ .
107
+ .
108
+ .
109
+ .
110
+ .
111
+ .
112
+
113
+ .
114
+ .
115
+ .
116
+ .
117
+ .
118
+ .
119
+ .
120
+ .
121
+
122
+ 223
123
+ 223
124
+ 224
125
+ 224
126
+ 225
127
+ 226
128
+ 226
129
+ 226
130
+
131
+ B Appendix - BDPC
132
+ 230
133
+ A
134
+ Existing DC Datasets . . . . . . . . . . . . . . . . . . . . . . . . 230
135
+ B
136
+ Visualization of Proposed DC Datasets . . . . . . . . . . . . . . 231
137
+ C Appendix - DUDE
138
+ A
139
+ Baseline Experiments Setup . . . . . . . . . .
140
+ A.1
141
+ Hyperparameter Defaults . . . . . . .
142
+ A.2
143
+ Generative LLM Prompt Fine-tuning
144
+ A.3
145
+ Confidence Estimation . . . . . . . . .
146
+ A.4
147
+ Evaluation . . . . . . . . . . . . . . .
148
+ B
149
+ Qualitative Examples . . . . . . . . . . . . .
150
+ B.1
151
+ Qualitative Examples - Competition .
152
+
153
+ .
154
+ .
155
+ .
156
+ .
157
+ .
158
+ .
159
+ .
160
+
161
+ .
162
+ .
163
+ .
164
+ .
165
+ .
166
+ .
167
+ .
168
+
169
+ .
170
+ .
171
+ .
172
+ .
173
+ .
174
+ .
175
+ .
176
+
177
+ .
178
+ .
179
+ .
180
+ .
181
+ .
182
+ .
183
+ .
184
+
185
+ .
186
+ .
187
+ .
188
+ .
189
+ .
190
+ .
191
+ .
192
+
193
+ .
194
+ .
195
+ .
196
+ .
197
+ .
198
+ .
199
+ .
200
+
201
+ .
202
+ .
203
+ .
204
+ .
205
+ .
206
+ .
207
+ .
208
+
209
+ .
210
+ .
211
+ .
212
+ .
213
+ .
214
+ .
215
+ .
216
+
217
+ .
218
+ .
219
+ .
220
+ .
221
+ .
222
+ .
223
+ .
224
+
225
+ .
226
+ .
227
+ .
228
+ .
229
+ .
230
+ .
231
+ .
232
+
233
+ 232
234
+ 232
235
+ 232
236
+ 232
237
+ 233
238
+ 235
239
+ 235
240
+ 241
241
+
242
+ D Appendix - KDD
243
+ A
244
+ Code and Datasets . . . . . . . . . . .
245
+ B
246
+ Implementation Details . . . . . . . .
247
+ C
248
+ Task Definitions . . . . . . . . . . . .
249
+ D
250
+ Additional Experiment Results . . . .
251
+ D.1
252
+ Tobacco-3482 Results . . . . .
253
+ D.2
254
+ PRImA Results . . . . . . . . .
255
+ D.3
256
+ RVL-CDIP-N Results . . . . .
257
+ D.4
258
+ Downstream DocVQA Results
259
+ D.5
260
+ Ablation Experiments . . . . .
261
+
262
+ .
263
+ .
264
+ .
265
+ .
266
+ .
267
+ .
268
+ .
269
+ .
270
+ .
271
+
272
+ .
273
+ .
274
+ .
275
+ .
276
+ .
277
+ .
278
+ .
279
+ .
280
+ .
281
+
282
+ .
283
+ .
284
+ .
285
+ .
286
+ .
287
+ .
288
+ .
289
+ .
290
+ .
291
+
292
+ .
293
+ .
294
+ .
295
+ .
296
+ .
297
+ .
298
+ .
299
+ .
300
+ .
301
+
302
+ .
303
+ .
304
+ .
305
+ .
306
+ .
307
+ .
308
+ .
309
+ .
310
+ .
311
+
312
+ .
313
+ .
314
+ .
315
+ .
316
+ .
317
+ .
318
+ .
319
+ .
320
+ .
321
+
322
+ .
323
+ .
324
+ .
325
+ .
326
+ .
327
+ .
328
+ .
329
+ .
330
+ .
331
+
332
+ .
333
+ .
334
+ .
335
+ .
336
+ .
337
+ .
338
+ .
339
+ .
340
+ .
341
+
342
+ .
343
+ .
344
+ .
345
+ .
346
+ .
347
+ .
348
+ .
349
+ .
350
+ .
351
+
352
+ .
353
+ .
354
+ .
355
+ .
356
+ .
357
+ .
358
+ .
359
+ .
360
+ .
361
+
362
+ 244
363
+ 244
364
+ 244
365
+ 246
366
+ 247
367
+ 249
368
+ 249
369
+ 249
370
+ 249
371
+ 249
372
+
373
+ .
374
+ .
375
+ .
376
+ .
377
+ .
378
+ .
379
+ .
380
+ .
381
+ .
382
+
383
+ .
384
+ .
385
+ .
386
+ .
387
+ .
388
+ .
389
+ .
390
+ .
391
+ .
392
+
393
+ .
394
+ .
395
+ .
396
+ .
397
+ .
398
+ .
399
+ .
400
+ .
401
+ .
402
+
403
+ .
404
+ .
405
+ .
406
+ .
407
+ .
408
+ .
409
+ .
410
+ .
411
+ .
412
+
413
+ Curriculum
414
+
415
+ 253
416
+
417
+ Publications
418
+
419
+ 255
420
+
421
+
assets/txts/pg_0033.txt ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Chapter 1
2
+
3
+ Introduction
4
+ “yourAmid
5
+ significant life events—like buying a house or expecting
6
+ firstborn child—lies a less cheerful reality that I experienced
7
+ firsthand: the hassle of dealing with manual paperwork.
8
+
9
+ For the former case, this required a lot of back-and-forth with
10
+ the bank, the notary, and the real estate agent, with each of
11
+ them requiring a different set of documents (e.g., monthly pay
12
+ stubs, bank statements, copies of national registry, etc.) to be
13
+ filled in, signed, and sent back for processing.
14
+ On the side of the document processors, each document needed
15
+ to be classified, key information extracted, and the information
16
+ validated against other documents to be able to prove my
17
+ solvency in making an offer, applying for a loan, or being drafted
18
+ as the future house owner. In between all parties and external
19
+ organizations, even more documents were either created, adapted,
20
+ or passed along such as the offer, the loan agreement, the deed
21
+ of sale, a soil certificate, etc.
22
+ This juxtaposition of valuable moments in life with cumbersome
23
+ administrative procedures involving manual document
24
+ processing forms the backdrop against which I aim to explore
25
+ and propose potential solutions in this thesis.
26
+
27
+
28
+ 1
29
+
30
+
assets/txts/pg_0034.txt ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2
2
+
3
+ INTRODUCTION
4
+
5
+ Documents are containers of information that are easily shareable. The concept
6
+ of a document dates back to when humans started writing and has been a
7
+ cornerstone of human communication ever since. In the age of digital technology,
8
+ documents are still the primary means of communication between humans and
9
+ organizations and form the backbone of many business processes. Human
10
+ communication is increasingly happening through digital channels, and the
11
+ COVID-19 pandemic has only accelerated this trend. We are increasingly living
12
+ in a “document society” [53], dependent on documents in our daily lives or for
13
+ recording second-hand knowledge. With instant gratification as the norm in
14
+ the digital age, people expect similar seamless interactions with businesses and
15
+ governments. While digitization has increased the speed and ease of documentbased communication, document processing remains a largely human effort with
16
+ organizations drowning under the sheer volume of documents they receive.
17
+ So why have organizations not switched en masse to
18
+ automated document processing?
19
+ The answer lies for some part in (I) the complexity of the task, and for the
20
+ other part in (II) the need for reliability and risk control.
21
+ (I) While it might be straightforward for a human (white-collar) worker to read
22
+ a long, structured document, understand its contents, categorize it, and extract
23
+ crucial information accordingly, this is not so easy for a machine. This could be
24
+ perceived as an instance of Moravec’s paradox [319], which states that tasks
25
+ that are easy for humans are hard for machines, and vice versa. However, in
26
+ recent times, significant strides forward have been made thanks to technological
27
+ advances combining Natural Language Processing (NLP), Computer Vision
28
+ (CV) and Machine Learning (ML). Document Understanding (DU) is
29
+ the umbrella term for both the end-to-end solution and the research field
30
+ studying to make machines interpret and understand documents (elaborated
31
+ on in Section 2.3). It has seen a surge in interest in the past few years, with
32
+ the rise of large-scale pretrained Language and Vision models (LLM, VLM)
33
+ [52, 94, 101, 187, 380, 383, 502] capable of modeling document inputs.
34
+ What makes DU challenging is that it encompasses multiple subtasks, each of
35
+ which is a research field in its own right, such as Optical Character Recognition
36
+ (OCR), Document Layout Analysis (DLA), Document Classification (DC), Key
37
+ Information Extraction (KIE), Visual Question Answering (VQA), etc. The
38
+ complexity of the task is further increased by the fact that documents are
39
+ multimodal, containing both text and images and that they are compositional,
40
+ i.e., the meaning of the document is not just the sum of its parts. Information
41
+ can appear in a wide range of forms including text, images, tables or graphs,
42
+ and be spread across multiple pages. Moreover, the meaning of a document
43
+
44
+
assets/txts/pg_0035.txt ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ INTRODUCTION
2
+
3
+ 3
4
+
5
+ can change depending on the context in which it is used. As an artifact of the
6
+ communication channel, not all documents are born digitally, and the quality
7
+ of the document can vary greatly, with some documents being handwritten,
8
+ scanned with low resolution, or even a picture of a document. Furthermore,
9
+ documents are often not standardized templates and can be highly variable in
10
+ terms of layout, structure, and content. Finally, the longer the document, the
11
+ more computationally demanding it becomes to process, and the more likely it
12
+ is to induce errors, which can be harder to detect.
13
+ Addressing the inherent challenges of document processing, and achieving high
14
+ levels of accuracy, processing speed, reliability, robustness, and scalability in
15
+ DU forms the applied scope of this thesis.
16
+ (II) Consider the example given of the birth certificate. While I might not
17
+ appreciate as much the manual handling of this document, if they had registered
18
+ my baby girl’s name (Feliz, Spanish writing without an accent on the ‘e’)
19
+ incorrectly, I would be pretty upset as this could have further repercussions.
20
+ Whereas this error might be easily rectified, it is not so easy to do so in the
21
+ case of a mortgage application, where the wrong information could lead to a
22
+ rejection of the application, or even worse, a loan agreement with the wrong
23
+ terms and conditions. This demonstrates that, even when full automation of
24
+ document processing is in high demand, it is not always desirable if the risk of
25
+ failure might be too large.
26
+ Nevertheless, a lot of the potential for automation remains untapped, and
27
+ organizations are increasingly looking for solutions to fully automate their
28
+ document processing workflows. However, full automation, implying perfect
29
+ recognition of document categories and impeccable information extraction is an
30
+ unattainable goal with the current state of technology [79].
31
+ The more realistic objective set is Intelligent Automation (IA) (elaborated
32
+ on in Section 2.4), where the goal is to have the machine estimate confidence
33
+ in its predictions, deriving business value with as high as possible volumes of
34
+ perfect predictions (Straight-Through-Processing, STP) without incurring extra
35
+ costs (False Positives, FP).
36
+ The leitmotif of this thesis will be the fundamental enablers of IA: confidence
37
+ estimation and failure prediction.
38
+ Calibrated uncertainty estimation with efficient and effective DU technology
39
+ will allow organizations to confidently automate their document processing
40
+ workflow, while keeping a human in the loop only for predictions with a higher
41
+ likelihood of being wrong. To date, however, little research has addressed the
42
+ question of how to make DU technology more reliable, as is illustrated in a toy
43
+ analysis (Table 1.1) reporting the absence of many IA-related keywords in the
44
+ Proceedings of the 2021 International Conference on Document Analysis and
45
+
46
+
assets/txts/pg_0036.txt ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 4
2
+
3
+ INTRODUCTION
4
+
5
+ Recognition (ICDAR) [289].
6
+ The thesis aims to fill this gap by proposing novel methods for uncertainty
7
+ estimation and failure prediction (Part I), and by providing a framework for
8
+ benchmarking and evaluating the reliability and robustness of DU technology,
9
+ as close as possible to real-world requirements (Part II).
10
+ Table 1.1. Comparative analysis of keywords in the ICDAR 2021 proceedings. While
11
+ many DU subtasks are represented, there is a lack of keywords related to IA. Do note
12
+ that calibration is used in the context of camera calibration, and not in the context of
13
+ confidence estimation.
14
+ keyword
15
+
16
+ freq
17
+
18
+ keyword
19
+
20
+ freq
21
+
22
+ document
23
+ classification
24
+
25
+ 3388
26
+ 242
27
+
28
+ 33
29
+ 0
30
+
31
+ key information
32
+
33
+ 56
34
+
35
+ question answering
36
+
37
+ 106
38
+
39
+ layout analysis
40
+
41
+ 223
42
+
43
+ calibration/calibrate
44
+ temperature scaling
45
+ failure prediction
46
+ misclassification detection
47
+ out-of-distribution
48
+ OOD
49
+ predictive uncertainty
50
+
51
+ 0
52
+ 25
53
+ 0
54
+
55
+ In the remainder of the Introduction, I will sketch the surrounding research
56
+ context, followed by the problem statement and research questions, and finally
57
+ the outline of the thesis manuscript.
58
+
59
+ 1.1
60
+
61
+ Research Context
62
+
63
+ All chapters of this dissertation have been executed as part of the Baekeland
64
+ PhD mandate (HBC.2019.2604) with financial support of VLAIO (Flemish
65
+ Innovation & Entrepreneurship) and Contract.fit. The latter is a Belgian-based
66
+ software-as-a-service (SaaS) provider of Intelligent Document Processing (IDP)
67
+ drawing on innovations in DU to power their product suite (email-routing,
68
+ Parble), and my generous employer since 2017.
69
+ Some of the joint work (Chapter 5) has been partially funded by a PhD
70
+ Scholarship from AGAUR (2023 FI-3-00223), and the Smart Growth Operational
71
+ Programme under projects no. POIR.01.01.01-00-1624/20 (Hiper-OCR - an
72
+ innovative solution for information extraction from scanned documents) and
73
+ POIR.01.01.01-00-0605/19 (Disruptive adoption of Neural Language Modelling
74
+ for automation of text-intensive work).
75
+ Moreover, given that the dissertation work has been performed over a large
76
+ span of time, it warrants putting it in the larger context and dynamics of AI
77
+ innovations, the state of DU as a field, how notions of ’reliability’ have evolved
78
+ over time, and finally the business context.
79
+
80
+
assets/txts/pg_0037.txt ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ RESEARCH CONTEXT
2
+
3
+ 5
4
+
5
+ This thesis started almost concurrently with the rise of the global COVID19 pandemic, making it hard to foster collaborations in the early stages. At
6
+ the start of the PhD, DU methodology was fairly established, with OCR and
7
+ Transformer-based pipelines such as BERT [94] and LayoutLM [502], which
8
+ is why we first prioritized the more fundamental challenge of decision-making
9
+ under uncertainty (Part I); which was followed by a step back, closer to applied
10
+ DU research (Part II).
11
+ The research community’s understanding of ‘reliability’ has also evolved over
12
+ time. When starting the work of Chapter 3, the notion of reliability was mostly
13
+ associated with uncertainty quantification and calibration. However, calibration
14
+ is not a panacea, and only fairly recently, Jaeger et al. [193] proposed a more
15
+ general framework encapsulating reliability and robustness. They promote the
16
+ more concrete and useful notion of failure prediction, which still involves
17
+ confidence/uncertainty estimation yet with an explicit definition of the failure
18
+ source which one wants to detect or guard against, e.g., in-domain test errors,
19
+ changing input feature distributions, novel class shifts, etc. Since I share a
20
+ similar view of the problem, I have focused following works on the more general
21
+ notion of failure prediction, which is also more in line with the business context
22
+ of IA.
23
+ Whereas we originally intended to work on multi-task learning of DU subtasks,
24
+ the rise of general-purpose LLMs offering a natural language interface to
25
+ documents rather than discriminative modeling (e.g., ChatGPT [52, 344]),
26
+ prompted us toward evaluating this promising technology in the context of
27
+ DU. More importantly, we observed the lack of sufficiently complex datasets
28
+ and benchmarks in DU that would allow us to tackle larger, more fundamental
29
+ questions such as ’Do text-only LLMs suffice for most low-level DU subtasks?’
30
+ (subsequently tackled in Chapter 5), which is why we shifted our focus to the
31
+ more applied research questions of benchmarking and evaluation (Part II).
32
+ Finally, the business context has also evolved over time. Originally, IDP was
33
+ practiced by legacy OCR companies; specialized vendors, offering a range of
34
+ solutions for specific document types (e.g., invoices, contracts, tax forms, etc.);
35
+ or cloud service providers, offering IDP as part of a larger suite of services
36
+ (e.g., AWS Textract, Azure Form Recognizer, etc.). However, the rise of both
37
+ open-source LLM development and powerful, though closed-source models has
38
+ lowered the barrier to entry for any new entrants or incumbents. This has led
39
+ to a commoditization of IDP, with the quality of the LLMs and the ease of
40
+ integration with existing business processes becoming key differentiators.
41
+
42
+
assets/txts/pg_0038.txt ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 6
2
+
3
+ 1.2
4
+
5
+ INTRODUCTION
6
+
7
+ Problem Statement and Questions
8
+
9
+ The general introduction sketches the context of the research, and motivates
10
+ the research questions. In this Section, I will formulate the problem statement
11
+ and research questions more formally and how they relate to the manuscript’s
12
+ contents.
13
+
14
+ 1.2.1
15
+
16
+ Reliable and Robust Deep Learning
17
+
18
+ The dissertation opens with the more fundamental challenge of targeting
19
+ reliability and robustness in Deep Learning, which covers fairly abstract concepts
20
+ that have been used interchangeably and inconsistently in the literature. They
21
+ will be defined more extensively in Section 2.2, but for now, consider reliability
22
+ as the ability to avoid failure, robustness as the ability to resist failure, and
23
+ resilience as the ability to recover from failure [373, 438, 455]. In Chapter 3, we
24
+ focus on the more concrete objective of predictive uncertainty quantification
25
+ (PUQ), which shows promise for improving reliability and robustness in Deep
26
+ Learning (DL) [123, 140, 173, 455]. Concretely, PUQ methods are expected to
27
+ elucidate sources of uncertainty such as a model’s lack of in-domain knowledge
28
+ due to either training data scarcity or model misspecification, or its ability to
29
+ flag potentially noisy, shifted or unknown input data [136].
30
+ We observed that the majority of prior PUQ research focused on regression and
31
+ CV tasks, while the applicability of PUQ methods had not been thoroughly
32
+ explored in the context of NLP. As mentioned earlier, most DU pipelines (in
33
+ 2020) were text-centric with a high dependency on the quality of OCR. Since
34
+ OCR is often considered a solved problem [262], we hypothesized that the main
35
+ source of error and uncertainty in DU would reside in the text representations
36
+ learned by deep neural networks (DNN)s. This is why we focused on the
37
+ more fundamental question of how well do PUQ methods scale in NLP? More
38
+ specifically, we restricted the scope to the prototypical, well-studied task of
39
+ text classification, for which we could leverage existing multi-domain datasets
40
+ varying in complexity, size and label space (multi-class vs. multi-label).
41
+ This leads to the following research questions:
42
+ RQ 1. When tested in realistic language data distributions on various text
43
+ classification tasks, how well do PUQ methods fare in NLP?
44
+
45
+
assets/txts/pg_0039.txt ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ PROBLEM STATEMENT AND QUESTIONS
2
+
3
+ 7
4
+
5
+ RQ 2. In which settings are PUQ methods most useful, i.e., which failure sources
6
+ / distribution shifts are they most sensitive to?
7
+ RQ 3. How can we obtain better PUQ estimates without overrelying on
8
+ computationally prohibitive methods, e.g., Deep Ensemble [238]?
9
+ RQ 4. How important are certain prior, neural architecture or hyperparameter
10
+ influences on the quality of PUQ estimation?
11
+ In a later chapter (Chapter 5), we introduce a complex benchmark for generic
12
+ DU that additionally tests for robustness to domain, visual and layout shifts,
13
+ and explores the novel problem of hallucination and control in natural language
14
+ generation (NLG) with LLMs from the perspective of calibrated and selective
15
+ DocVQA. The general task formulation involves a natural language question (on
16
+ content, aspect, form, visual/layout), an input document, and a set of reference
17
+ answers. The model is expected to provide a natural language answer, an answer
18
+ confidence and a (binary) abstention decision. Evaluation is done in terms of
19
+ answer correctness, calibration and selective prediction. On the one hand, one
20
+ expects a model to lower confidence when unsure about the correctness of a
21
+ predicted answer. On the other hand, one expects a model to abstain from
22
+ answering and refrain from hallucinations on unanswerable questions (which
23
+ had been explicitly added in the dataset).
24
+ RQ 5. How severe is the problem of hallucination and control in LLMs when
25
+ evaluated in a selective, free-form DocVQA task setting?
26
+
27
+ 1.2.2
28
+
29
+ Realistic and Efficient Document Understanding
30
+
31
+ The second part of the dissertation focuses on the more applied research questions
32
+ of realistic and efficient DU. The overall objective is to make DU technology
33
+ more generically applicable (Chapter 5), evaluation more in sync with real-world
34
+ requirements (Chapters 4 and 5), and more efficient at modeling the multimodal
35
+ and compositional nature of documents (Chapters 5 and 6).
36
+ Due to the proximity to business applications and the risks of leaking personal
37
+ information, DU research benchmarks have diverged substantially from the
38
+ real-world distributions of document data. For instance, DU datasets are often
39
+ limited to single-page document images, are from outdated sources (e.g., IIT-
40
+
41
+
assets/txts/pg_0040.txt ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 8
2
+
3
+ INTRODUCTION
4
+
5
+ CDIP [252]), or are restricted to a single domain or a small set of document
6
+ types.
7
+ We posit that larger, fundamental questions in DU remain unanswered due to a
8
+ lack of sufficiently complex datasets and benchmarks with a rich methodology
9
+ covering evaluation beyond the independent and identically distributed (i.i.d.)
10
+ test set setting. While there exist performant models for DU subtasks such
11
+ as OCR, DC, KIE, etc., it is unclear how to move from these specific analysis
12
+ and recognition tasks to models that can reason and understand documents. A
13
+ truly end-to-end DU solution must handle the complexity and variety of realworld documents and subtasks, which could be expressed as natural language
14
+ questions. Moreover, it should be able to generalize to any question on any
15
+ document and reason over multiple pages and modalities.
16
+ The following research questions are addressed in Chapters 4 and 5:
17
+ RQ 6. How can we iteratively close the gap between research and practice in DU?
18
+ RQ 7. How can we design a resource that comprehensively challenges the state-ofthe-art?
19
+ RQ 8. Which DU aspects are most challenging for current state-of-the-art LLMs?
20
+ How can these be incorporated in a benchmark to allow proper measurements
21
+ of future improvements?
22
+ However, moving the goalpost beyond a single-page context inevitably requires
23
+ us to reconsider the research challenge of efficiency in DU. The rise of LLMs
24
+ has enabled a new generation of DU pipelines, which are more flexible and
25
+ easier to maintain than separate and specialized subtask modules, but also
26
+ more computationally demanding. Importantly, most LLMs are not designed
27
+ to handle the multimodality and long context windows of multipage documents,
28
+ and are often unaware of the visual and layout semantics of documents.
29
+ The research questions for Chapter 6 address the efficiency challenge in DU:
30
+ RQ 9. How can we efficiently infuse LLMs with semantic layout awareness for
31
+ more focused information extraction?
32
+ RQ 10. To what degree can model compression resolve the problem of efficiency
33
+ in processing documents?
34
+
35
+
assets/txts/pg_0041.txt ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ OUTLINE
2
+
3
+ 1.3
4
+
5
+ 9
6
+
7
+ Outline
8
+
9
+ Figure 1.1. Overview of publications and how they relate to the chapters.
10
+
11
+ Figure 1.2. Visual Overview of the research questions and how they relate to the
12
+ chapters.
13
+
14
+ After the introductory Chapters 1 and 2, we continue with the publication-based
15
+ chapters that form the core of the thesis, which are structured in two parts.
16
+ Part I consists of a single chapter, Chapter 3, which presents a benchmarking
17
+ study of PUQ methods applied on real-world text classification datasets with
18
+ 1-D convolutional neural networks and pretrained transformers. It motivates
19
+ a novel PUQ method, Deep Ensemble with Concrete Dropout, combining the
20
+ benefits of both methods, and showing promise for improving reliability and
21
+ robustness in NLP at a lower computational cost. The chapter concludes with
22
+ a discussion of the results, including targeted ablation studies, and provides
23
+ recommendations for future research.
24
+ Part II consists of three chapters, Chapters 4 to 6, which all focus on the more
25
+ applied research questions of realistic and efficient DU.
26
+
27
+
assets/txts/pg_0042.txt ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 10
2
+
3
+ INTRODUCTION
4
+
5
+ Chapter 4 reflects on the current state of DU research, and proposes guidelines to
6
+ foster document dataset construction efforts. It introduces two novel document
7
+ classification datasets, RVL-CDIP_MP and RVL-CDIP-N_MP, as extensions
8
+ of the RVL-CDIP dataset [165] with multipage documents. The datasets are
9
+ accompanied by a comprehensive experimental analysis, which shows promise
10
+ from advancing multipage document representations and inference.
11
+ Chapter 5 introduces the multi-faceted DUDE
12
+ benchmark for assessing
13
+ generic DU, that was also hosted as a competition to challenge the DU
14
+ community. It describes the complete methodology and design of the dataset,
15
+ targeting model innovations that can handle the complexity and variety of
16
+ real-world documents and subtasks, and generalize to any documents and any
17
+ questions. Next to a discussion of the competition results, it also presents
18
+ our own comprehensive benchmarking study of SOTA LLMs with varying the
19
+ context length and what modalities are represented.
20
+ Chapter 6 investigates how to efficiently obtain more semantic document layout
21
+ awareness. We explore what affects the teacher-student knowledge gap in
22
+ KD-based model compression methods, and design a downstream task setup
23
+ to evaluate the robustness of distilled DLA models on zero-shot layout-aware
24
+ DocVQA.
25
+ Finally, Chapter 7 concludes the thesis with a summary of the main contributions
26
+ (Section 7.1), and a discussion of future research directions. As a logical followup to Chapter 5, we propose in Section 7.2.2.1 how the DUDE dataset could
27
+ be extended to become the ‘ultimate’ DU benchmark. The thesis ends with a
28
+ hypothetical, informed design of how the research presented would form part of
29
+ an end-to-end, fully-fledged IA-DU solution (Section 7.2.2.2).
30
+
31
+
assets/txts/pg_0043.txt ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Chapter 2
2
+
3
+ Fundamentals
4
+ This chapter provides all the necessary background knowledge necessary to
5
+ understand the contributions of this thesis.
6
+ The key questions covered here are:
7
+ i.
8
+ ii.
9
+ iii.
10
+ iv.
11
+ v.
12
+ vi.
13
+
14
+ How to feed a document to an algorithm to perform arbitrary tasks on it?
15
+ How to model language, vision, layout or structure?
16
+ How does it learn and then operate at inference time?
17
+ How does it estimate prediction uncertainty?
18
+ How to evaluate its performance?
19
+ How to integrate it as a useful, end-to-end system in a document workflow?
20
+
21
+ Section 2.1 explains the basic setting from the perspective of statistical learning
22
+ theory [472], which is a mathematical framework for analyzing how algorithms
23
+ learn from data with minimal error. Section 2.2 gives a primer on reliability and
24
+ robustness, particularly calibration, failure detection and relevant evaluation
25
+ metrics. Section 2.3 surveys the DU field, and discusses the state of the art in
26
+ DU technology. Finally, Section 2.4 covers Intelligent Automation to illustrate
27
+ how solving the challenges posed in this thesis will enable to augment human
28
+ intelligence, creativity and productivity in straight-through business processes.
29
+
30
+ 11
31
+
32
+
assets/txts/pg_0044.txt ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 12
2
+
3
+ FUNDAMENTALS
4
+
5
+ Contents
6
+ 2.1
7
+
8
+ 2.2
9
+
10
+ 2.3
11
+
12
+ 2.4
13
+
14
+ 2.1
15
+
16
+ Statistical Learning - basics . . . . . . . . . . . .
17
+ 2.1.1 Neural Networks . . . . . . . . . . . . .
18
+ 2.1.2 Probabilistic Evaluation . . . . . . . . .
19
+ 2.1.3 Architectures . . . . . . . . . . . . . . .
20
+ Reliability and Robustness . . . . . . . . . . . .
21
+ 2.2.1 Generalization and Adaptation . . . . .
22
+ 2.2.2 Confidence Estimation . . . . . . . . . .
23
+ 2.2.3 Evaluation Metrics . . . . . . . . . . . .
24
+ 2.2.4 Calibration . . . . . . . . . . . . . . . .
25
+ 2.2.5 Predictive Uncertainty Quantification . .
26
+ 2.2.6 Failure Prediction . . . . . . . . . . . . .
27
+ Document Understanding . . . . . . . . . . . . .
28
+ 2.3.1 Task Definitions . . . . . . . . . . . . . .
29
+ 2.3.2 Datasets . . . . . . . . . . . . . . . . . .
30
+ 2.3.3 Models . . . . . . . . . . . . . . . . . . .
31
+ 2.3.4 Challenges in Document Understanding
32
+ Intelligent Automation . . . . . . . . . . . . . .
33
+
34
+ .
35
+ .
36
+ .
37
+ .
38
+ .
39
+ .
40
+ .
41
+ .
42
+ .
43
+ .
44
+ .
45
+ .
46
+ .
47
+ .
48
+ .
49
+ .
50
+ .
51
+
52
+ .
53
+ .
54
+ .
55
+ .
56
+ .
57
+ .
58
+ .
59
+ .
60
+ .
61
+ .
62
+ .
63
+ .
64
+ .
65
+ .
66
+ .
67
+ .
68
+ .
69
+
70
+ .
71
+ .
72
+ .
73
+ .
74
+ .
75
+ .
76
+ .
77
+ .
78
+ .
79
+ .
80
+ .
81
+ .
82
+ .
83
+ .
84
+ .
85
+ .
86
+ .
87
+
88
+ .
89
+ .
90
+ .
91
+ .
92
+ .
93
+ .
94
+ .
95
+ .
96
+ .
97
+ .
98
+ .
99
+ .
100
+ .
101
+ .
102
+ .
103
+ .
104
+ .
105
+
106
+ .
107
+ .
108
+ .
109
+ .
110
+ .
111
+ .
112
+ .
113
+ .
114
+ .
115
+ .
116
+ .
117
+ .
118
+ .
119
+ .
120
+ .
121
+ .
122
+ .
123
+
124
+ 12
125
+ 14
126
+ 15
127
+ 17
128
+ 18
129
+ 19
130
+ 20
131
+ 21
132
+ 25
133
+ 27
134
+ 29
135
+ 30
136
+ 31
137
+ 33
138
+ 34
139
+ 35
140
+ 38
141
+
142
+ Statistical Learning
143
+
144
+ Two popular definitions of Machine Learning (ML) are given below.
145
+ Machine Learning is the field of study that gives computers the ability
146
+ to learn without being explicitly programmed. [406]
147
+ A computer program is said to learn from experience E with respect to
148
+ some class of tasks T, and performance measure P, if its performance
149
+ at tasks in T, as measured by P, improves with experience E. [317]
150
+ Following these, different types of learning problems [472] can be discerned, of
151
+ which the most common (and the one used throughout our works) is supervised
152
+ learning. It defines experience E as a set of input-output pairs for which the
153
+ task T is to learn a mapping f from inputs X ∈ X to outputs Y ∈ Y, and the
154
+ performance measure P is the risk or expected loss (Equation (2.1)), given a
155
+ (0-1) loss function ` : Y × Y → R+ .
156
+ R(f ) = E(X,Y )∼P [`(Y, f (X))]
157
+
158
+ (2.1)
159
+
160
+ The mapping f (·; θ) : X → Y is typically parameterized by a set of parameters
161
+ θ (omitted whenever it is fixed) and a hypothesis class F, which is a set of
162
+
163
+
assets/txts/pg_0045.txt ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ STATISTICAL LEARNING
2
+
3
+ 13
4
+
5
+ possible functions. The objective is to find a function f ∈ F that minimizes the
6
+ risk, or even better, the Bayes risk
7
+ f ∗ = inf R(f ),
8
+ f ∈F
9
+
10
+ (2.2)
11
+
12
+ which is the minimum achievable risk over all functions in F. The latter is only
13
+ realizable with infinite data or having access to the data-generating distribution
14
+ P(X , Y). In practice, Equation (2.2) is unknown, and the goal is to find a
15
+ function fˆ that minimizes the empirical risk
16
+ N
17
+ 1 X
18
+ `(yi , f (xi )),
19
+ fˆ =
20
+ N i=1
21
+
22
+ (2.3)
23
+
24
+ where (xi , yi ) are N independently and identically distributed (i.i.d.) samples
25
+ drawn from an unknown distribution P on X × Y. This is known as empirical
26
+ risk minimization (ERM), which is a popular approach to supervised learning,
27
+ under which three important processes are defined.
28
+ Training or model fitting is the process of estimating the parameters θ of a
29
+ model, which is done by minimizing a suitable loss function ` over a training
30
+ set D = {(xi , yi )}N
31
+ i=1 of N i.i.d. samples.
32
+ Inference or prediction is the process of estimating the output of a model for
33
+ a given input, which is typically done by computing the posterior probability
34
+ P (y|x) over the output space Y. Classification output is a discrete label, while
35
+ regression output is a continuous value.
36
+ Evaluation involves measuring the quality of a model’s predictions, which is
37
+ typically done by computing a suitable evaluation metric over a test set Dtest
38
+ of i.i.d. samples, which were not used for training.
39
+ However, ERM has its caveats concerning generalization to unseen data,
40
+ requiring either additional assumptions on the hypothesis class F, which
41
+ are known as inductive biases, and/or regularization to penalize the
42
+ complexity of the function class F [445]. In neural networks (discussed in
43
+ detail Section 2.1.1), the former is controlled by the architecture of the network,
44
+ while the latter involves specifying constraints to parameters or adding a
45
+ regularization term to the loss function.
46
+ 
47
+ 
48
+ fˆ = arg min R̂(f ) + λΨ(θ)
49
+ f ∈F
50
+
51
+ (2.4)
52
+
53
+
assets/txts/pg_0046.txt ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 14
2
+
3
+ FUNDAMENTALS
4
+
5
+ Equation (2.4) defines regularized empirical risk minimization (RERM),
6
+ where Ψ(θ) is a regularization term and λ is a hyperparameter that controls the
7
+ trade-off between the empirical risk (denoted with R̂) and the regularization
8
+ term.
9
+ All these concepts will be revisited in the context of neural networks in
10
+ Section 2.1.1, where we will also discuss the optimization process of the model
11
+ parameters θ, how inference differs in the case of probabilistic models to estimate
12
+ uncertainty (Section 2.2.5), and how regularization affects confidence estimation
13
+ and calibration (Section 2.2.4).
14
+
15
+ 2.1.1
16
+
17
+ Neural Networks
18
+
19
+ An artificial neural network (NN) is a mathematical approximation inspired
20
+ by data processing in the human brain [396]. It can be represented by a
21
+ network topology of interconnected neurons that are organized in layers that
22
+ successively refine intermediately learned feature representations of the input
23
+ [448] that are useful for the task at hand, e.g., classifying an animal by means
24
+ of its size, shape and fur, or detecting the sentiment of a review by focusing on
25
+ adjectives.
26
+ A basic NN building block is a linear layer, which is a linear function of the
27
+ input parameters: f (x) = W x + b, where the bias term b is a constant vector
28
+ shifting the decision boundary away from the origin and the weight matrix
29
+ W holds most parameters that rotate the decision boundary in input space.
30
+ Activation functions (e.g., tanh, ReLu, sigmoid, softmax, GeLu) are used to
31
+ introduce non-linearity in the model, which is required for learning complex
32
+ functions.
33
+ The first deep learning (DL) network (stacking multiple linear layers) dates
34
+ back to 1965 [191], yet the term ‘Deep Learning’ was coined in 1986 [398].
35
+ The first successful DL application was a demonstration of digit recognition
36
+ in 1998 [244], followed by DL for CV [90, 223] and NLP [76]. The recent
37
+ success of DL is attributed to the availability of large datasets, the increase in
38
+ computational power, the development of new algorithms and architectures,
39
+ and the commercial interest of large companies.
40
+ Consider a conventional DL architecture as a composition of parameterized
41
+ functions. Each consists of a configuration of layers (e.g., convolution, pooling,
42
+ activation function, normalization, embeddings) determining the type of input
43
+ transformation (e.g., convolutional, recurrent, attention) with (trainable)
44
+ parameters linear/non-linear w.r.t. the input x. Given the type of input,
45
+ e.g., language which is naturally discrete-sequential, or vision which presents a
46
+
47
+
assets/txts/pg_0047.txt ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ STATISTICAL LEARNING
2
+
3
+ 15
4
+
5
+ Sigmoid Function
6
+ 1
7
+ σ(z) =
8
+ 1 + exp−z
9
+
10
+ Softmax Function
11
+ exp(z)
12
+ softmax(z) = PK
13
+ k=1 exp(zk )
14
+
15
+ Table 2.1. Sigmoid and softmax activation functions for binary and multi-class
16
+ classification, respectively.
17
+
18
+ ready continuous-spatial signal, different DL architectures have been established,
19
+ which will be discussed in Section 2.1.3.
20
+ A K-class classification function with an l-layer NN with d dimensional input x ∈
21
+ Rd is shorthand fθ : Rd → RK , with θ = {θj }lj=1 assumed to be optimized, either
22
+ partially or fully, using backpropagation and a loss function. More specifically,
23
+ it presents a non-convex optimization problem, concerning multiple feasible
24
+ regions with multiple locally optimal points within each. With maximumlikelihood estimation estimation, the goal is to find the optimal parameters
25
+ or weights that minimize the loss function, effectively interpolating the training
26
+ data. This process involves traversing the high-dimensional loss landscape.
27
+ Upon convergence of model training, the optimized parameters form a solution
28
+ in the weight-space, representing a unique mode (specific function fθ̂ ). However,
29
+ when regularization techniques such as weight decay, dropout, or early stopping
30
+ are applied, the objective shifts towards maximum-a-posteriori (MAP), to
31
+ take into account the prior probability of the parameters. The difference in
32
+ parameter estimation forms the basis for several uncertainty estimation methods,
33
+ covered in Section 2.2.5.
34
+ A prediction is a translation of a model’s output to which a standard decision
35
+ rule is applied, e.g., to obtain the top-1/k prediction (Equation (2.5)), or decode
36
+ structured output according to a function maximizing total likelihood with
37
+ optionally additional diversity criteria.
38
+ ŷ = argmax fθ̂ (x)
39
+
40
+ (2.5)
41
+
42
+ Considering standard NNs, the last layer outputs a vector of real-valued logits
43
+ z ∈ RK , which in turn are normalized to a probability distribution over K
44
+ classes using a sigmoid or softmax function (Table 2.1).
45
+
46
+ 2.1.2
47
+
48
+ Probabilistic Evaluation
49
+
50
+ The majority of our works involves supervised learning with NNs, formulated
51
+ generically as a probabilistic predictor in Definition 1.
52
+
53
+
assets/txts/pg_0048.txt ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 16
2
+
3
+ FUNDAMENTALS
4
+
5
+ Definition 1. Probabilistic predictor f : X → ∆Y that outputs a conditional
6
+ probability distribution P (y 0 |x) over outputs y 0 ∈ Y for an i.i.d. drawn sample
7
+ (x,y).
8
+ |Y|
9
+
10
+ Definition 2 (Probability Simplex). Let ∆Y := {v ∈ R≥0 : kvk1 = 1} be a
11
+ probability simplex of size |Y| − 1 as a geometric representation of a probability
12
+ space, where each vertex represents a mutually exclusive label and each point
13
+ has an associated probability vector v [368].
14
+ Figure 2.1 illustrates a multi-class classifier, where Y = [K] for K=3 classes.
15
+ photos.google.com
16
+
17
+ Google Photos
18
+ Home for all your photos and videos,
19
+ automatically organized and easy to
20
+ share.
21
+
22
+ https://photos.google.com/search/fox
23
+
24
+ Figure 2.1. Scatter plot of a ternary problem (K = 3, N = 100) in the probability
25
+ simplex space. Example of overconfident misprediction (above is a Shiba Inu dog) and
26
+ correct sharp prediction (clear image of Beagle).
27
+
28
+ In practice, loss functions are proper scoring rules [330], S : ∆Y × Y → R, that
29
+ measure the quality of a probabilistic prediction P (ŷ|x) given the true label y.
30
+ The cross-entropy (CE) loss is a popular loss function for classification, while
31
+ the mean-squared error (MSE) loss is used for regression. In Section 2.2, we
32
+ will discuss the evaluation of probabilistic predictors in more detail, including
33
+ the calibration of confidence estimates and the detection of out-of-distribution
34
+ samples.
35
+
36
+ 2.1.3
37
+
38
+ Architectures
39
+
40
+ Throughout the chapters of the thesis, we have primarily used the following
41
+ NN architectures: Convolutional Neural Networks (CNNs), Transformer
42
+ Networks . We will briefly introduce the building blocks of these architectures,
43
+ with a focus on how they are used in the context of document understanding.
44
+
45
+
assets/txts/pg_0049.txt ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ STATISTICAL LEARNING
2
+
3
+ 2.1.3.1
4
+
5
+ 17
6
+
7
+ Convolutional Neural Networks
8
+
9
+ Convolutional Neural Networks (CNNs) [244] are a class of DNNs designed
10
+ primarily for visual and grid-spatial data such as images. They are inspired by
11
+ the visual cortex of animals, which contains neurons that are sensitive to small
12
+ subregions of the visual field, called a receptive field. The receptive fields of
13
+ different neurons partially overlap such that they cover the entire visual field,
14
+ growing larger in deeper layers of the visual cortex.
15
+
16
+ Figure 2.2. Sketch of a CNN architecture. The input is a 2D image, which is iteratively
17
+ convolved with a set of learned filters detecting specific input features, e.g., edges,
18
+ corners, blobs, to produce feature maps. Feature maps are then downsampled using
19
+ a pooling operation.
20
+
21
+ As illustrated in Figure 2.2, CNNs are composed of multiple convolutional layers,
22
+ which hierarchically extract features from the input, followed by pooling and
23
+ fully-connected layers to classify the input based on the downsampled features.
24
+ A filter K ∈ Rd×d is a rectangular matrix of trainable weights with width and
25
+ height d typically smaller than the input x. A convolutional layer applies filters
26
+ sliding over the input, with each filter producing a feature map:
27
+ F = K ∗ x,
28
+
29
+ (2.6)
30
+
31
+ where the convolution operation ∗ computes a dot product between filter entries
32
+ and the covered portions of the input.
33
+ Thanks to the weight sharing property of the convolution operation, CNNs are
34
+ able to learn translation invariance, i.e., the ability to recognize an object
35
+ regardless of its position in the image. This is particularly useful for object
36
+ detection, where the position of the object in the image is unknown.
37
+ This architecture was used for document image classification and document
38
+ layout analysis (Section 6.3.2). A special version is 1-D CNNs, which we applied
39
+ to one-hot encoded text data in text classification benchmarking (Section 3.4.3).
40
+
41
+
assets/txts/pg_0050.txt ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 18
2
+
3
+ 2.1.3.2
4
+
5
+ FUNDAMENTALS
6
+
7
+ Language Neural Networks
8
+
9
+ The first step to represent language input into a format compatible with NNs is
10
+ to convert units of language, words or characters or “tokens” as depending on
11
+ a tokenizer, into numerical vectors. This is done by means of embeddings,
12
+ which are typically learned as part of the training process, and are used to
13
+ represent the meaning of words in a continuous vector space. There have been
14
+ multiple generations of word embeddings, starting with one-hot vectors that
15
+ represent each word by a vector of zeros with a single one at its vocabulary index,
16
+ which depends highly on the tokenizer used and does not capture semantic
17
+ relationships between words. Alternatives are frequency-based embeddings,
18
+ such as TF-IDF vectors, which represent each word by its frequency in the
19
+ corpus, weighted by its inverse frequency in the corpus, capturing some lexical
20
+ semantics, but not the context in which the word appears. The next generation
21
+ are Word2Vec embeddings that are trained to predict the context of a word, i.e.,
22
+ the words that appear before and after it in a sentence. FastText embeddings
23
+ improve this by considering a character n-gram context, i.e., a sequence of n
24
+ characters. The current generation are contextual word embeddings that
25
+ are trained to predict the context of a word, taking into account the surrounding
26
+ context and learning the sense of a word based on its context, e.g., ‘bank’ as
27
+ a river bank vs. a financial institution in ‘Feliz sits at the bank of the river
28
+ Nete’. Another important innovation is subword tokenization to deal with
29
+ the out-of-vocabulary (OOV) problem, which is particularly relevant for
30
+ morphologically rich languages, such as Dutch, where word meaning can be
31
+ inferred from its subwords. A clever extension is byte pair encoding (BPE)
32
+ [412], which is a data compression algorithm that iteratively replaces the most
33
+ frequent pair of bytes in a sequence with a single, unused byte, until a predefined
34
+ vocabulary size is reached. This is particularly useful for multilingual models,
35
+ where the vocabulary size would otherwise be too large to fit in memory.
36
+ The first embedding layer is typically a lookup table, which maps each word
37
+ to a unique index in a vocabulary, and each index to a vector of real numbers.
38
+ The embedding layer is typically followed by a recurrent, convolutional or
39
+ attention layer, which is used to capture the sequential nature of language.
40
+ Recurrent Neural Networks (RNNs) and recurrent architectures extended
41
+ to model long-range dependencies such as Long Short-Term Memory (LSTM)
42
+ and Gated Recurrent Unit (GRU) networks were the dominant architectures
43
+ for sequence modeling in NLP, yet they have been superseded by Transformers
44
+ in recent years.
45
+
46
+
assets/txts/pg_0051.txt ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ STATISTICAL LEARNING
2
+
3
+ 2.1.3.3
4
+
5
+ 19
6
+
7
+ Transformer Network
8
+
9
+ A Transformer [473] is a sequence-to-sequence model that uses an attention
10
+ mechanism to capture long-range dependencies in the input sequence, benefiting
11
+ from increased parallelization. Traditionally, it consists of an encoder and a
12
+ decoder, each composed of multiple layers of self-attention and feed-forward
13
+ layers.
14
+ Attention is a mechanism that allows for soft selection of relevant information
15
+ from a set of candidates, e.g., tokens in a document, based on a query, e.g.,
16
+ a token in the document. The scaled dot-product P
17
+ attention is defined
18
+ n
19
+ for a sequence of length n as follows: Att(Q, K, V ) = i=1 αi Vi . It utilizes
20
+ three learnable weight matrices, each multiplied with all token embeddings in a
21
+ sequence to build queries Q ∈ Rn×dq , keys K ∈ Rn×dq , and values V ∈ Rn×dv .
22
+ The output of the attention mechanism is a weighted sum of the unnormalized
23
+ values, where each attention weight of the i-th key is computed by normalizing
24
+ exp(QT
25
+ i Ki )
26
+ the dot product between the query and key vectors αi = Pn exp(Q
27
+ T K ) . For
28
+ j=1
29
+
30
+ J
31
+
32
+ j
33
+
34
+ training stability, the dot product is typically scaled by the square root of the
35
+ dimensionality of the query and key vectors. This is followed by a feed-forward
36
+ layer to capture non-linear relationships between the tokens in the sequence.
37
+ There exist different forms of attention, depending on the type of relationship
38
+ that is captured. Self-attention computes the attention of each token w.r.t.
39
+ all other tokens in the sequence, which changes the representation of each token
40
+ based on the other tokens in the sequence. Multi-head attention is a set
41
+ of h attention layers, which every Transformer uses to concurrently capture
42
+ different types of relationships, concatenated together after the parallelized
43
+ processing. Cross-attention computes the attention of each token in one
44
+ sequence w.r.t. all tokens in another sequence, which is used in encoder-decoder
45
+ Transformer architectures for e.g., summarization and machine translation.
46
+ Specific to decoder layers, masked attention is used to prevent the decoder
47
+ from attending to future tokens in the sequence by masking the upper triangle
48
+ of the attention matrix calculation.
49
+ A major downside to Transformers is the quadratic complexity of the attention
50
+ mechanism (Figure 2.3), which makes them computationally inefficient for long
51
+ sequences. This has been addressed by a wealth of techniques [120], such as
52
+ sparsifing attention, targeting recurrence, downsampling, random or low-rank
53
+ approximations.
54
+ Position Embeddings are indispensable for Transformers to be able to process
55
+ sequences, as they do not have any notion of order or position of tokens in
56
+ a sequence. The most common type of position embedding is a sinusoidal
57
+
58
+
assets/txts/pg_0052.txt ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 20
2
+
3
+ FUNDAMENTALS
4
+
5
+ Quadratic complexity
6
+
7
+ Figure 2.3. Illustration of the main attention mechanisms in a Transformer.
8
+
9
+ embedding with a fixed frequency and phase, f (x) = sin(ωx + φ), where ω is the
10
+ frequency and φ is the phase which are learned as part of the training process,
11
+ and they are typically shared across all tokens in the sequence. Integrating
12
+ position information into Transformers can be achieved in different ways, which
13
+ [105, Table 1] gives an overview for.
14
+ Transformers have gradually taken over as an end-to-end architecture for both
15
+ NLP and CV tasks, albeit adoption in CV has been slower, due to the lack
16
+ of spatial invariance in the original Transformer architecture. This has been
17
+ addressed by recent works, such as Vision Transformer (ViT) [101], which uses
18
+ a patch-based input representation with position embeddings.
19
+ A large language model (LLM) consists of a stack of Transformers that is
20
+ pretrained on a large corpus of text, typically using a self-supervised learning
21
+ objective, such as predicting the next token in a sequence. The goal of LLMs
22
+ is to learn a general-purpose language representation that can be fine-tuned
23
+ to perform well on a wide range of downstream tasks. LLMs have disrupted
24
+ NLP in recent years, as they have achieved SOTA performance on a wide
25
+ range of tasks thanks to pretraining on large amounts of data. The most
26
+ popular LLMs are BERT [95], RoBERTa [287], ELECTRA [73], T5 [383],
27
+ GPT-3 [52], Llama-2 [452], and Mistral [199]. Next to challenges specific to
28
+ modeling document inputs, explained in Section 2.3.4, open challenges for
29
+ LLMs include: (i) structured output generation, (ii) domain-specific knowledge
30
+ injection (e.g., does retrieval-augmented generation (RAG) suffice? [253, 347]),
31
+ (iii) multimodality.
32
+ Vision-language models (VLM) are a recent development in multimodal
33
+ learning, which combine the power of LLMs with vision encoders to perform
34
+ tasks that require understanding both visual and textual information. The most
35
+ popular VLMs are CLIP [381], UNITER [70], FLAVA [423] and GPT-4 [344].
36
+ In every chapter of this dissertation we have used Transformers, either as part
37
+
38
+
assets/txts/pg_0053.txt ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ RELIABILITY AND ROBUSTNESS
2
+
3
+ 21
4
+
5
+ of a foundation model for DU tasks (Chapters 4 to 6) or to contrast with 1-D
6
+ CNNs in text classification (Chapter 3). Note that [265] share our concerns that
7
+ NLP needs a new ‘playground’ with more realistic tasks and benchmarks, which
8
+ extend beyond sentence-level contexts to more complex document-level tasks.
9
+ Alternative sub-quadratic architectures have started addressing Transformer’s
10
+ computational inefficiency on long sequences, e.g., Mamba [152] and Longnet
11
+ [99]. Time will tell if these will be able to compete with the Transformer’s
12
+ dominance in foundation models.
13
+
14
+ 2.2
15
+
16
+ Reliability and Robustness
17
+
18
+ Chapter 3 contains a lot of relevant content on the basic relation between
19
+ uncertainty quantification, calibration, and distributional generalization or
20
+ detection tasks. Here, we will focus on the more general concepts of reliability
21
+ and robustness, and how they relate to concepts used throughout the rest of
22
+ the thesis. Next, we discuss the need for confidence estimation and appropriate
23
+ evaluation metrics, followed by short summaries of the main research trends in
24
+ calibration and uncertainty quantification.
25
+ Emerging guidance and regulations [2, 3, 475] place increasing importance on
26
+ the reliability and robustness of ML systems, particularly once they are used
27
+ in the public sphere or in safety-critical applications. In ML, reliability and
28
+ robustness are often used interchangeably [78, 420, 455], yet they are distinct
29
+ concepts, and it is important to understand the difference between them. This
30
+ thesis uses the following definitions of reliability and robustness, adapted from
31
+ systems engineering literature [395]:
32
+ Definition 3 [Reliability]. Reliability is the ability of a system to consistently
33
+ perform its intended function in a specific, known environment for a specific
34
+ period of time, with a specific level of expected accuracy [395]. Closer to the ML
35
+ context, this entails all evaluation under the i.i.d. assumption, allowing for some
36
+ benign shifts of the distribution, including predictive performance evaluation
37
+ with task-dependent metrics (accuracy, F1, perplexity, etc.), calibration, selective
38
+ prediction, uncertainty estimation, etc.
39
+ Reliability requires to clearly specify the role an ML component plays in a
40
+ larger system, and to define the expected behavior of the system as a function
41
+ of alignment with the training data distribution. This is particularly important
42
+ in the context of black-box models, where the inner workings of the model are
43
+ not transparent to the user. In this case, the user needs to be aware of the
44
+ model’s limitations, e.g., model misspecification, lack of training data, and the
45
+
46
+
assets/txts/pg_0054.txt ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 22
2
+
3
+ FUNDAMENTALS
4
+
5
+ model needs to be able to communicate its own uncertainty to the user. This is
6
+ the focus of Chapter 3.
7
+ Definition 4 [Robustness]. Robustness is the ability of a system to maintain
8
+ its intended function despite a wide range of disturbances, with a minimal
9
+ degradation of performance [395]. Such disturbances can take the form of
10
+ adversarial attacks, distributional shifts, or other types of noise. In the ML
11
+ context, this entails all evaluation violating the i.i.d. assumption, including
12
+ adversarial and label noise robustness, out-of-distribution detection, domain
13
+ generalization, extrapolation, etc.
14
+ Robustness is more involved with the application scope in which a model can
15
+ perform well, assuming that the model can maintain some degree of its prediction
16
+ capacity on non-i.i.d. data which might be unknown at training time. Detecting
17
+ when the model is operating outside of its intended scope is an important part
18
+ of robustness to prevent failure propagation to downstream systems.
19
+ Resilience is another component of the R3 : reliability, robustness, resilience
20
+ concept in systems engineering, yet it is not a focus of this thesis, nor is it
21
+ a relevant qualifier of the ML model in isolation, as it is more related to the
22
+ system as a whole. Resilient systems are able to recover from disturbances, even
23
+ those caused by model misspecification, e.g., by adapting to new environments
24
+ and unexpected inputs from unknown distributions or by self-healing.
25
+
26
+ 2.2.1
27
+
28
+ Generalization and Adaptation
29
+
30
+ To complete the R3 picture, we cannot overlook the generalizationadaptation spectrum, which has been less explored in our works, yet it is an
31
+ important part of current practices in ML.
32
+ Definition 5 [Generalization-adaptation]. Generalization is the ability of
33
+ a system to perform its intended function in a wide range of environments,
34
+ including those not known at design time [395]. Each environment is defined by
35
+ a data distribution over a domain and a task, and generalization is the ability
36
+ of a model to perform well on new data drawn from the same distribution.
37
+ Adaptation is the ability of a system to perform its intended function in a specific,
38
+ known environment, despite changes in the system itself or its environment
39
+ [395]. This entails the ability of a model to perform well on new data drawn
40
+ from a different distribution, which is known at design time.
41
+ Different settings of generalization-adaptation are: in-distribution (same
42
+ domain and task), domain generalization (same task, different domain), task
43
+ generalization (same domain, different task), out-of-distribution (different
44
+
45
+
assets/txts/pg_0055.txt ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ RELIABILITY AND ROBUSTNESS
2
+
3
+ 23
4
+
5
+ domain or task). If the model has access to limited samples for training
6
+ on the new distribution, it is referred to as few-shot learning or no samples at
7
+ all, zero-shot learning; if it is able to adapt to new distributions over time, or
8
+ accumulate knowledge over different tasks without retraining from scratch [87],
9
+ it is referred to as continual learning or incremental learning.
10
+ Many of these settings are referred to in business as out-of-the-box, self-learning,
11
+ yet without any formal definitions given. Domain and task generalization are
12
+ major selling points of pretrained LLMs, which are able to perform well on a
13
+ wide range of tasks and domains. In the case of very different distributions, e.g.,
14
+ a different task/expected output or an additional domain/input modality, it is
15
+ often necessary to fine-tune the model on a small amount of data from the new
16
+ distribution, which is known as transfer learning. Specific to LLMs, instruction
17
+ tuning is a form of transfer learning, where samples from a new distribution are
18
+ appended with natural language instructions [69, 532]. This approach has been
19
+ used in Chapter 5 to adapt pretrained LLMs to the task of DocVQA, in an
20
+ effort to reduce the amount of annotated data required to generalize to unseen
21
+ domains and questions.
22
+
23
+ 2.2.2
24
+
25
+ Confidence Estimation
26
+
27
+ A quintessential component of reliability and robustness requires a model to
28
+ estimate its own uncertainty, or inversely to translate model outputs into
29
+ probabilities or ‘confidence’ (Definition 6).
30
+ Definition 6 [Confidence Scoring Function]. Any function g : X → R
31
+ whose continuous output aims to separate a model’s failures from correct
32
+ predictions can be interpreted as a confidence scoring function (CSF) [193].
33
+ Note that while it is preferable to have the output domain of g ∈ [0, 1] for easier
34
+ thresholding, this is not a strict requirement.
35
+ Circling back on the question of why one needs a CSF, there are multiple reasons:
36
+ i) ML models are continually improving, yet 0 test error is an illusion, even a
37
+ toy dataset (MNIST) is not perfectly separable; ii) once a model is deployed,
38
+ performance deterioration is expected due to i.i.d. assumptions breaking; iii)
39
+ generative models are prone to hallucinations [198], requiring some control
40
+ mechanisms and guardrails to guide them.
41
+ Below, we present some common CSFs used in practice [114, 172, 194, 539],
42
+ where for convenience the subscript is reused to denote the k-th element of the
43
+ output vector g(x) = gk (x).
44
+
45
+
assets/txts/pg_0056.txt ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 24
2
+
3
+ FUNDAMENTALS
4
+
5
+ I. Maximum softmax probability (MSP): g(x) = maxy0 ∈Y fy0 (x)
6
+ II. Maximum logit: g(x) = maxy0 ∈Y zy0 (x), with logits z ∈ RK
7
+ P
8
+ III. Negative entropy: g(x) = − y0 ∈Y fy0 (x) log fy0 (x)
9
+ IV. Margin: g(x) = maxy0 ∈Y fy0 (x) − maxy00 ∈Y\y0 fy00 (x)
10
+ V. Distance-based measures
11
+ • kNN distance: A 1D outlier score derived from the average distance
12
+ of the feature representation of x to its k nearest neighbors in the
13
+ training distribution
14
+ • Mahalanobis distance [390]: The minimum distance of the feature
15
+ map (e.g., penultimate layer activations) of a test input to classconditional Gaussian distributions of the training data.
16
+ VI. Bayesian uncertainty estimation
17
+ Chapter 3 used MSP and negative entropy as CSFs, next to various PUQ
18
+ methods for Bayesian uncertainty estimation. Other chapters used MSP as it
19
+ is the most common CSF in practice, requiring only logits as input. From the
20
+ use of CSFs also follows the need to evaluate their statistical quality next to
21
+ task-specific predictive performance metrics, which is discussed next.
22
+
23
+ 2.2.3
24
+
25
+ Evaluation Metrics
26
+
27
+ In an ideal world, the evaluation metric of interest would be the same as the loss
28
+ function used for training, yet this is rarely the case in practice, as the gradientbased optimization process requires a continuously differentiable function, while
29
+ the metric of interest is often non-differentiable, e.g., accuracy vs. cross-entropy
30
+ in classification.
31
+ Throughout our works, we have used (or extended) multiple predictive
32
+ performance, calibration, and robustness metrics, of which the most interesting
33
+ are respectively outlined.
34
+ Average Normalized Levenshtein Similarity (ANLS) is a metric introduced in [39] for the evaluation of VQA, which was then extended [449] to
35
+ support lists and be invariant to the order of provided answers. We adapted the
36
+ underlying Levenshtein Distance (LD) metric [251] to support not-answerable
37
+ questions, NA(G) = I[type(G) = not-answerable ] (see Equation (2.7)).
38
+
39
+
assets/txts/pg_0057.txt ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ RELIABILITY AND ROBUSTNESS
2
+
3
+ 25
4
+
5
+ Consider for simplicity, the evaluation of a single non-list ground truth answer
6
+ G and prediction P̂ , each with string lengths |G| and |P̂ |, respectively.
7
+
8
+ 1 if NA(G) ∧ |P̂ | > 0,
9
+
10
+
11
+
12
+
13
+
14
+ 0 if NA(G) ∧ |P̂ | = 0,
15
+
16
+
17
+
18
+
19
+  |G| if |P̂ | = 0,
20
+ LD(G, P̂ ) =
21
+ LD(tail(G), tail(P̂ )) if G[0] = P̂ [0],
22
+
23
+
24
+
25
+
26
+ if G[0] 6= P̂ [0] (deletion),
27
+  LD(tail(G), P̂ )
28
+
29
+
30
+
31
+
32
+ 1 + min
33
+ LD(G, tail(P̂ ))
34
+ if G[0] 6= P̂ [0] (insertion),
35
+
36
+
37
+
38
+
39
+ LD(tail(G), tail(P̂ )) if G[0] 6= P̂ [0] (substitution)
40
+ (2.7)
41
+ Each of the conditions is tested in turn, and the first one that is true is executed.
42
+ The normalized similarity metric is then defined as
43
+ NLS(G, P̂ ) =
44
+
45
+ 1 − LD(G, P̂ )
46
+ max(1, |G|, |P̂ |)
47
+
48
+ .
49
+
50
+ Given multiple ground truth answer variants G = {a1 , a2 , ...} and a predicted
51
+ answer for P̂Qi for each question Q in the test set of size N , we define the
52
+ complete metric as follows:
53
+ N 
54
+ 
55
+ 
56
+ 1 X
57
+ ANLS =
58
+ max s a, P̂Qi
59
+ N i=1 a∈Gi
60
+
61
+ 
62
+
63
+ 
64
+
65
+ s a, P̂Qi =
66
+
67
+
68
+ 
69
+ 
70
+  NLS a, P̂Q
71
+ i
72
+  0
73
+
74
+ 
75
+ 
76
+ if NLS a, P̂Qi > τ
77
+ 
78
+ 
79
+ ,
80
+ if NLS a, P̂Qi < τ
81
+
82
+ (2.8)
83
+
84
+ (2.9)
85
+
86
+ where we follow prior literature [39, 449] in setting the threshold τ = 0.5.
87
+ In the case of a list-type question, Hungarian matching is performed following
88
+ [449] according to NLS between each ground truth answer part and each
89
+ prediction answer part.
90
+ Proper scoring rules [330] are used for generic evaluation of predictive
91
+ performance, which calculate scoring at the instance-level while measuring both
92
+ the quality of the predictive function and predicted probability distribution (as
93
+ they are not compatible with an arbitrary CSF):
94
+ • Negative Log Likelihood (NLL) [378] is both a popular loss function
95
+ (cross-entropy) and scoring rule which only penalizes (wrong) log
96
+ probabilities qi given to the true class, with I an indicator function defining
97
+
98
+
assets/txts/pg_0058.txt ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 26
2
+
3
+ FUNDAMENTALS
4
+
5
+ the true class. This measure more heavily penalizes sharp probabilities,
6
+ which are close to the wrong edge or class by over/under-confidence.
7
+ `NLL (f ) = −
8
+
9
+ N K
10
+ 1 XX
11
+ I [yi = k] · log (fk (xi ))
12
+ N i=1
13
+
14
+ (2.10)
15
+
16
+ k=1
17
+
18
+ • Brier Score [50] is a scoring rule that measures the accuracy of a
19
+ probabilistic classifier and is related to the mean-squared error (MSE) loss
20
+ function. Brier score is more commonly used in industrial practice since it
21
+ is an λ2 metric (score between 0 and 1), yet it penalizes tail probabilities
22
+ less severely than NLL.
23
+ `BS (f ) =
24
+
25
+ N K
26
+ 1 XX
27
+ 2
28
+ (I (yi = k) − fk (xi ))
29
+ N i=1
30
+
31
+ (2.11)
32
+
33
+ k=1
34
+
35
+ All metrics following require a CSF g(x) to be defined, and can pertain to
36
+ specific evaluation settings [389] tested in Section 3.4.5.
37
+ Expected Calibration Error (ECE) [156, 332] is a default metric to evaluate
38
+ top-1 prediction miscalibration. A calibration estimator (Definition 7) measures
39
+ the Lp norm difference between a model’s posterior and the true likelihood of
40
+ being correct.
41
+ Definition 7 (Lp Calibration Error). [231, 463]
42
+ The Lp calibration error of f : X → ∆Y over the joint distribution (X × Y )
43
+ with the Lp norm p ∈ [1, ∞) is given by:
44
+ 
45
+ 
46
+ CEp (f )p = E(X,Y ) kE[Y | f (X)] − f (X)kpp
47
+ (2.12)
48
+ The popular ECE metric [332] with condition I[Y = ŷ] is a special case of the
49
+ above with p = 1, where the expectation is approximated using a histogram.
50
+ MaxCE defines the worst-case risk version with p = ∞, effectively reporting on
51
+ the bin with the highest error. As part of Chapter 5, we contributed a novel
52
+ empirical estimator of top-1 calibration for the task of VQA, where the exact
53
+ accuracy condition I[Y = ŷ] in ECEis replaced by I[ANLS(y, ŷ) > τ ]. Prior
54
+ work [329] used a similar strategy of thresholding continuous quality scores to
55
+ be able to estimate ECE.
56
+ In practice, ECE is implemented as a histogram binning estimator that
57
+ discretizes predicted probabilities into ranges of possible values for which
58
+ conditional expectation can be estimated. Concretely, the probability space
59
+ is partitioned into B bins bi with i ∈ {1, ..., B}, where for each bin bi the gap
60
+ between observed accuracy and bin confidence P¯b is measured, with a final
61
+
62
+
assets/txts/pg_0059.txt ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ RELIABILITY AND ROBUSTNESS
2
+
3
+ 27
4
+
5
+ average weighted by the number of samples per bin |bi |.
6
+ ECE =
7
+
8
+ B
9
+ X
10
+ |bi |
11
+ i=1
12
+
13
+ N
14
+
15
+ acc(bi ) − P¯b (bi )
16
+
17
+ (2.13)
18
+
19
+ To minimize the drawbacks inherited from histogram binning, as suggested
20
+ by the literature [231, 342, 393, √
21
+ 463], we have applied an equal-mass binning
22
+ scheme with 100 bins (close to N ). While plenty of histogram-based ECE
23
+ estimator implementations exist, many design hyperparameters are not reported
24
+ or exposed:
25
+ I.
26
+ II.
27
+ III.
28
+ IV.
29
+ V.
30
+
31
+ `p norm
32
+ The number of bins (beyond the unfounded default of |B| = 15)
33
+ Different binning schemes (equal-range, equal-mass)
34
+ Binning range to define the operating zone
35
+ Proxy used as bin accuracy (lower-e.g., center, upper-edge)
36
+
37
+ We upstreamed 1 a generic implementation of binning-based ECE as part of
38
+ the ICDAR 2023 DUDE competition (Chapter 5).
39
+ Alternative formulations have been developed for multi-class [342, 370, 492]
40
+ and multi-label calibration [493, 520]. Measurements of “strong” calibration,
41
+ over the full predicted vector instead of the winning class, are reported less in
42
+ practice. Possible reasons are that they render class-wise scorings, either based
43
+ on adaptive thresholds or require estimation of kernel-based calibration error
44
+ to derive hypothesis tests. While we are mindful of alternatives (revisited in
45
+ Section 2.2.4), we have found that the simpler “weak” calibration measured by
46
+ ECE meets the practical requirements for most of our benchmarking.
47
+ Area-Under-Risk-Coverage-Curve (AURC) [138, 193] measures the possible trade-offs between coverage (proportion of test set%) and risk (error %
48
+ under given coverage). The metric explicitly assesses i.i.d. failure detection
49
+ performance as desired for safe deployment. It has advantages as a primary
50
+ evaluation metric given that it is effective both when underlying prediction
51
+ models are the same or different (as opposed to AUROC or AUPR). Its most
52
+ general form (without any curve approximation), with a task-specific evaluation
53
+ metric ` and CSF g, is defined as:
54
+ 
55
+ 
56
+ E(x̃,ỹ)∼PXY [`([f (x̃)], ỹ)I[g(x̃) > g(x)]]
57
+ AURC(f, g) = Ex∼P(X)
58
+ (2.14)
59
+ Ex̃∼PX [I[g(x̃) > g(x)]]
60
+ This captures the intuition that the CSF g should be able to rank instances by
61
+ their risk, and that the risk should be low for instances with high confidence.
62
+ 1 https://huggingface.co/spaces/jordyvl/ece
63
+
64
+
assets/txts/pg_0060.txt ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 28
2
+
3
+ FUNDAMENTALS
4
+
5
+ The standard curve metric can be obtained by sorting all CSF estimates and
6
+ P
7
+ T P +F P
8
+ evaluating risk ( T PF+F
9
+ P ) and coverage ( T P +F P +F N +T N ) for each threshold t (P
10
+ if above threshold) from high to low, together with their respective correctness (T
11
+ if correct). This is normally based on exact match, yet for generative evaluation
12
+ in Section 5.3.5, we have applied ANLS thresholding instead. Formulated
13
+ this way, the best possible AURC is constrained by the model’s test error
14
+ (1-ANLS) and the number of test instances. AURC might be more sensible for
15
+ evaluating in a high-accuracy regime (e.g., 95% accuracy), where risk can be
16
+ better controlled and error tolerance is an apriori system-level decision [115].
17
+ This metric was used in every chapter of Part II.
18
+ For the evaluation under distribution shift in Chapter 3, we have used binary
19
+ classification metrics following [172], Area Under the Receiver Operating
20
+ Characteristic Curve (AUROC) and Area Under the Precision-Recall
21
+ Curve (AUPR), which are threshold-independent measures that summarize
22
+ detection statistics of positive (out-of-distribution) versus negative (indistribution) instances. In this setting, AUROC corresponds to the probability
23
+ that a randomly chosen out-of-distribution sample is assigned a higher confidence
24
+ score than a randomly chosen in-distribution sample. AUPR is more informative
25
+ under class imbalance.
26
+
27
+ 2.2.4
28
+
29
+ Calibration
30
+
31
+ The study of calibration originated in the meteorology and statistics literature,
32
+ primarily in the context of proper loss functions [330] for evaluating
33
+ probabilistic forecasts. Calibration promises i) interpretability, ii) system
34
+ integration, iii) active learning, and iv) improved accuracy. A calibrated model,
35
+ as defined in Definition 8, can be interpreted as a probabilistic model, which can
36
+ be integrated into a larger system, and can guide active learning with potentially
37
+ fewer samples. Research into calibration regained popularity after repeated
38
+ empirical observations of overconfidence in DNNs [156, 339].
39
+ Definition 8 (Perfect calibration). [86, 88, 520] Calibration is a property of
40
+ an empirical predictor f , which states that on finite-sample data it converges
41
+ to a solution where the confidence scoring function reflects the probability ρ of
42
+ being correct. Perfect calibration, CE(f ) = 0, is satisfied iff:
43
+ P(Y = Ŷ | f (X) = ρ) = ρ,
44
+
45
+ ∀ρ ∈ [0, 1]
46
+
47
+ (2.15)
48
+
49
+ Below, we characterize calibration research in two directions: (A) CSF evaluation
50
+ with both theoretical guarantees and practical estimation methodologies
51
+ • Estimators for calibration notions beyond top-1 [229, 231, 342, 463]
52
+
53
+
assets/txts/pg_0061.txt ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ RELIABILITY AND ROBUSTNESS
2
+
3
+ 29
4
+
5
+ • Theoretical frameworks to generalize over existing metrics and design
6
+ novel metrics [43, 231, 492, 493]
7
+ • Specialize towards a task such as multi-class classification [463], regression
8
+ [228, 428], or structured prediction [227]
9
+ • Alternative error estimation procedures, based on histogram regression
10
+ [156, 331, 332, 340, 343], kernels [230, 370, 492, 493] or splines [159]
11
+ (B) Calibration methods for improving the reliability of a model by adapting
12
+ the CSF or inducing calibration during training of f :
13
+ • Learn a post-hoc forecaster F : f (X) → [0, 1] on top of f (overview: [298])
14
+ • Modify the training procedure with regularization (overview: [277, 370])
15
+ Due to its importance in practice, we will provide more detail on train-time
16
+ calibration methods. It has been shown for a broad class of loss functions
17
+ that risk minimization leads to Fisher consistent, Bayes optimal classifiers in
18
+ the asymptotic limit [25, 495]. These can be shown to decompose into a sum
19
+ of multiple metrics including both accuracy and calibration error [144, 177].
20
+ However, there is no –finite data, nor asymptotic– guarantee that classifiers
21
+ trained with proper loss functions containing an explicit calibration term
22
+ will eventually be well-calibrated. In practice, being entangled with other
23
+ optimization terms often leads to sub-optimal calibration. For this reason,
24
+ recent studies [12, 230, 492] have derived trainable estimators of calibration
25
+ to have a better handle (γ > 0) on penalizing miscalibration, i.e., by jointly
26
+ optimizing risk (R(f ) = EX,Y [` (Y, f (X))]) and parameterized calibration error
27
+ (CE) as in Equation (2.16).
28
+ fˆ = arg min (R(f ) + γ CE(f ))
29
+ f ∈F
30
+
31
+ (2.16)
32
+
33
+ Many of these methods are implicitly or explicitly maximizing entropy of
34
+ predictions or entropy relative to another probability distribution, e.g., Entropy
35
+ Regularization [361], Label Smoothing (LS) [327], Focal Loss [324], Marginbased LS [277], next to more direct (differentiable), kernel-based calibration
36
+ error estimation [211, 230, 370, 492, 493, 526]. We had expected community
37
+ contribution on the DUDE competition (Chapter 5) to take advantage of this
38
+ wealth of calibration methods, yet the majority of submissions used uncalibrated
39
+ models with MSP, requiring more education on the importance of calibration
40
+ in practice.
41
+
42
+