English
Dutch
kdebie-nfi commited on
Commit
6af8fad
1 Parent(s): 3740ba6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -23,7 +23,7 @@ Firstly, we do not have photos of snippets available for all classes, but only f
23
  Secondly, the set of classes is dynamic and changes quickly over time (as more fireworks are added), and it is not feasible to train a new model each time.
24
  Therefore, we require a one-shot model, which we construct as follows.
25
 
26
- ### Embedding model
27
 
28
  First, we train an embedder that produces similar embeddings for snippets and wrappers of the same category, and dissimilar embeddings for different categories.
29
  The embedding model is based on the Vision Transformer architecture (see [arXiv](https://arxiv.org/abs/2010.11929)).
@@ -36,14 +36,14 @@ It has the following specifications:
36
  * Fixed learning rate of 0.000015 with Adam optimizer
37
  * Epochs: 100
38
 
39
- ### Classification
40
 
41
  To be able to link a photo of snippets to a firework category, we construct reference images based on the wrappers in each category (see “data” for a description), which we convert into reference embeddings using the trained embedding model.
42
  In the same way, we create an embedding for the snippet photo.
43
  To produce a classification, we calculate the L2 distance (normalized between 0-1) between the snippet photo embedding and the reference embeddings of each of the categories.
44
  The minimum distance across all reference embeddings for each category is taken as the representative score for that category.
45
 
46
- #### Text filter
47
 
48
  Optionally, a text filter is applied on top of the classification that filters the fireworks labels based on the text that is on the snippet.
49
  The text on the snippets must be manually entered.
@@ -55,7 +55,7 @@ The dataset on which the model is trained and evaluated is constructed from fire
55
  The final model (provided here) is trained on all data.
56
  For evaluation purposes, we split the data in a train and test set, which we describe under (“evaluation”).
57
 
58
- ### Lab snippets
59
 
60
  For 38 categories of fireworks, we have created snippets of their wrappers by exploding these fireworks.
61
  These snippets are photographed with a high-quality DSLR camera on a white background, directly from above, with good lighting conditions (hence ‘lab’ snippets).
@@ -63,14 +63,14 @@ The snippets are then segmented, after which samples are taken of between 1 and
63
  We take 35 samples for each N, so 35 times 1 snippet, 35 times 2 snippets, ... leading to a total of 350 snippet photos per category.
64
  Then, in total, the set of lab snippets consists of 350 * 38 = 13.300 images.
65
 
66
- ### Mock-crime scene snippets
67
 
68
  For some of the categories, we have created photos that are more realistically taken at a crime scene.
69
  As we expect the model to work better when there is less background noise in the image, we have created photos that we believe are reasonably comparable to what may be done in crime scene circumstances.
70
  Therefore, we have mostly created images where snippets are laid out on so-called 'DNA blankets', which may be green or blue in appearance but at least produce a somewhat plain background.
71
  In total, we have 2489 such photos available, from 7 different categories of fireworks.
72
 
73
- ### Artificial snippets
74
 
75
  As the embedding model must produce embeddings for all fireworks categories, and we do not have snippets available for every category, we also create ‘artificial snippets’ by taking random crops from each fireworks wrapper.
76
  These artificial snippet photos consist of between 1 and 10 snippets, and we construct 35 images for each wrapper.
@@ -89,7 +89,7 @@ As the mock-crime scene dataset only consists of 7 classes, we are unable to con
89
  In practice, a drop in performance may of course be expected in the worst-case scenario for (mock-)crime scene snippets.
90
  Overall, we find that the model performs very well for classes that are present in the train set, and that the text filter gives a significant boost if this is not the case.
91
 
92
- ### Lab snippets
93
 
94
 
95
  | Metric | Worst-case (without text filter) | Best-case (without text filter) | Worst-case (with text filter) | Best-case (with text filter) |
@@ -100,7 +100,7 @@ Overall, we find that the model performs very well for classes that are present
100
  | Accuracy @ 10| 0.48 | 1.00 | 0.84 | 1.00 |
101
  | Accuracy @ 25| 0.75 | 1.00 | 0.90 | 1.00 |
102
 
103
- ### Mock snippets
104
 
105
  | Metric | Without text filter | With text filter |
106
  |--------------|------------------------|------------------------|
@@ -112,7 +112,7 @@ Overall, we find that the model performs very well for classes that are present
112
 
113
  Note that the final model is trained on all data, so we expect performance to increase somewhat as compared to these metrics.
114
 
115
- ### Limitations
116
 
117
  The evaluation results described above may not be representative of real-world performance of the model for several reasons.
118
  Firstly, the model was only trained and evaluated on photos of snippets with relatively plain backgrounds and relatively good lightning conditions.
 
23
  Secondly, the set of classes is dynamic and changes quickly over time (as more fireworks are added), and it is not feasible to train a new model each time.
24
  Therefore, we require a one-shot model, which we construct as follows.
25
 
26
+ ### _Embedding model_
27
 
28
  First, we train an embedder that produces similar embeddings for snippets and wrappers of the same category, and dissimilar embeddings for different categories.
29
  The embedding model is based on the Vision Transformer architecture (see [arXiv](https://arxiv.org/abs/2010.11929)).
 
36
  * Fixed learning rate of 0.000015 with Adam optimizer
37
  * Epochs: 100
38
 
39
+ ### _Classification_
40
 
41
  To be able to link a photo of snippets to a firework category, we construct reference images based on the wrappers in each category (see “data” for a description), which we convert into reference embeddings using the trained embedding model.
42
  In the same way, we create an embedding for the snippet photo.
43
  To produce a classification, we calculate the L2 distance (normalized between 0-1) between the snippet photo embedding and the reference embeddings of each of the categories.
44
  The minimum distance across all reference embeddings for each category is taken as the representative score for that category.
45
 
46
+ #### _Text filter_
47
 
48
  Optionally, a text filter is applied on top of the classification that filters the fireworks labels based on the text that is on the snippet.
49
  The text on the snippets must be manually entered.
 
55
  The final model (provided here) is trained on all data.
56
  For evaluation purposes, we split the data in a train and test set, which we describe under (“evaluation”).
57
 
58
+ ### _Lab snippets_
59
 
60
  For 38 categories of fireworks, we have created snippets of their wrappers by exploding these fireworks.
61
  These snippets are photographed with a high-quality DSLR camera on a white background, directly from above, with good lighting conditions (hence ‘lab’ snippets).
 
63
  We take 35 samples for each N, so 35 times 1 snippet, 35 times 2 snippets, ... leading to a total of 350 snippet photos per category.
64
  Then, in total, the set of lab snippets consists of 350 * 38 = 13.300 images.
65
 
66
+ ### _Mock-crime scene snippets_
67
 
68
  For some of the categories, we have created photos that are more realistically taken at a crime scene.
69
  As we expect the model to work better when there is less background noise in the image, we have created photos that we believe are reasonably comparable to what may be done in crime scene circumstances.
70
  Therefore, we have mostly created images where snippets are laid out on so-called 'DNA blankets', which may be green or blue in appearance but at least produce a somewhat plain background.
71
  In total, we have 2489 such photos available, from 7 different categories of fireworks.
72
 
73
+ ### _Artificial snippets_
74
 
75
  As the embedding model must produce embeddings for all fireworks categories, and we do not have snippets available for every category, we also create ‘artificial snippets’ by taking random crops from each fireworks wrapper.
76
  These artificial snippet photos consist of between 1 and 10 snippets, and we construct 35 images for each wrapper.
 
89
  In practice, a drop in performance may of course be expected in the worst-case scenario for (mock-)crime scene snippets.
90
  Overall, we find that the model performs very well for classes that are present in the train set, and that the text filter gives a significant boost if this is not the case.
91
 
92
+ ### _Lab snippets_
93
 
94
 
95
  | Metric | Worst-case (without text filter) | Best-case (without text filter) | Worst-case (with text filter) | Best-case (with text filter) |
 
100
  | Accuracy @ 10| 0.48 | 1.00 | 0.84 | 1.00 |
101
  | Accuracy @ 25| 0.75 | 1.00 | 0.90 | 1.00 |
102
 
103
+ ### _Mock snippets_
104
 
105
  | Metric | Without text filter | With text filter |
106
  |--------------|------------------------|------------------------|
 
112
 
113
  Note that the final model is trained on all data, so we expect performance to increase somewhat as compared to these metrics.
114
 
115
+ ### _Limitations_
116
 
117
  The evaluation results described above may not be representative of real-world performance of the model for several reasons.
118
  Firstly, the model was only trained and evaluated on photos of snippets with relatively plain backgrounds and relatively good lightning conditions.