File size: 4,764 Bytes
b0f2ac0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
I've been working through the first two lessons of
[the fastai course](https://course.fast.ai/). For lesson one I trained a model
to recognise my cat, Mr Blupus. For lesson two the emphasis is on getting those
models out in the world as some kind of demo or application.
[Gradio](https://gradio.app) and
[Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
prototype of your model on the internet.

This MVP app runs two models to mimic the experience of what a final deployed
version of the project might look like.

- The first model (a classification model trained with fastai, available on the
  Huggingface Hub
  [here](https://huggingface.co/strickvl/redaction-classifier-fastai) and
  testable as a standalone demo
  [here](https://huggingface.co/spaces/strickvl/fastai_redaction_classifier)),
  classifies and determines which pages of the PDF are redacted. I've written
  about how I trained this model [here](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).
- The second model (an object detection model trained using [IceVision](https://airctic.com/) (itself
  built partly on top of fastai)) detects which parts of the image are redacted.
  This is a model I've been working on for a while and I described my process in
  a series of blog posts (see below).

This MVP app does several things:

- it extracts any pages it considers to contain redactions and displays that
  subset as an [image carousel](https://gradio.app/docs/#o_carousel). It also
  displays some text alerting you to which specific pages were redacted.
- if you click the "Analyse and extract redacted images" checkbox, it will:
  - pass the pages it considered redacted through the object detection model
  - calculate what proportion of the total area of the image was redacted as
    well as what proportion of the actual content (i.e. excluding margins etc
    where there is no content)
  - create a PDF that you can download that contains only the redacted images,
    with an overlay of the redactions that it was able to identify along with
    the confidence score for each item.

## The Dataset

I downloaded a few thousand publicly-available FOIA documents from a government
website. I split the PDFs up into individual `.jpg` files and then used
[Prodigy](https://prodi.gy/) to annotate the data. (This process was described
in
[a blogpost written last
year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).)
For the object detection model, the process was quite a bit more involved and I
direct you to the series of articles referenced below in the 'Further Reading' section.

## Training the model

I trained the classification model with fastai's flexible `vision_learner`, fine-tuning
`resnet18` which was both smaller than `resnet34` (no surprises there) and less
liable to early overfitting. I trained the model for 10 epochs.

The object detection model is trained using IceVision, with VFNet as the
model and `resnet50` as the backbone. I trained the model for 50 epochs and
reached 89% accuracy on the validation data.

## Further Reading

This initial dataset spurred an ongoing interest in the domain and I've since
been working on the problem of object detection, i.e. identifying exactly which
parts of the image contain redactions.

Some of the key blogs I've written about this project:

- How to annotate data for an object detection problem with Prodigy
  ([link](https://mlops.systems/redactionmodel/computervision/datalabelling/2021/11/29/prodigy-object-detection-training.html))
- How to create synthetic images to supplement a small dataset
  ([link](https://mlops.systems/redactionmodel/computervision/python/tools/2022/02/10/synthetic-image-data.html))
- How to use error analysis and visual tools like FiftyOne to improve model
  performance
  ([link](https://mlops.systems/redactionmodel/computervision/tools/debugging/jupyter/2022/03/12/fiftyone-computervision.html))
- Creating more synthetic data focused on the tasks my model finds hard
  ([link](https://mlops.systems/tools/redactionmodel/computervision/2022/04/06/synthetic-data-results.html))
- Data validation for object detection / computer vision (a three part series —
  [part 1](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/19/data-validation-great-expectations-part-1.html),
  [part 2](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/26/data-validation-great-expectations-part-2.html),
  [part 3](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/28/data-validation-great-expectations-part-3.html))