kdebie-nfi
commited on
Commit
•
6af8fad
1
Parent(s):
3740ba6
Update README.md
Browse files
README.md
CHANGED
@@ -23,7 +23,7 @@ Firstly, we do not have photos of snippets available for all classes, but only f
|
|
23 |
Secondly, the set of classes is dynamic and changes quickly over time (as more fireworks are added), and it is not feasible to train a new model each time.
|
24 |
Therefore, we require a one-shot model, which we construct as follows.
|
25 |
|
26 |
-
###
|
27 |
|
28 |
First, we train an embedder that produces similar embeddings for snippets and wrappers of the same category, and dissimilar embeddings for different categories.
|
29 |
The embedding model is based on the Vision Transformer architecture (see [arXiv](https://arxiv.org/abs/2010.11929)).
|
@@ -36,14 +36,14 @@ It has the following specifications:
|
|
36 |
* Fixed learning rate of 0.000015 with Adam optimizer
|
37 |
* Epochs: 100
|
38 |
|
39 |
-
###
|
40 |
|
41 |
To be able to link a photo of snippets to a firework category, we construct reference images based on the wrappers in each category (see “data” for a description), which we convert into reference embeddings using the trained embedding model.
|
42 |
In the same way, we create an embedding for the snippet photo.
|
43 |
To produce a classification, we calculate the L2 distance (normalized between 0-1) between the snippet photo embedding and the reference embeddings of each of the categories.
|
44 |
The minimum distance across all reference embeddings for each category is taken as the representative score for that category.
|
45 |
|
46 |
-
####
|
47 |
|
48 |
Optionally, a text filter is applied on top of the classification that filters the fireworks labels based on the text that is on the snippet.
|
49 |
The text on the snippets must be manually entered.
|
@@ -55,7 +55,7 @@ The dataset on which the model is trained and evaluated is constructed from fire
|
|
55 |
The final model (provided here) is trained on all data.
|
56 |
For evaluation purposes, we split the data in a train and test set, which we describe under (“evaluation”).
|
57 |
|
58 |
-
###
|
59 |
|
60 |
For 38 categories of fireworks, we have created snippets of their wrappers by exploding these fireworks.
|
61 |
These snippets are photographed with a high-quality DSLR camera on a white background, directly from above, with good lighting conditions (hence ‘lab’ snippets).
|
@@ -63,14 +63,14 @@ The snippets are then segmented, after which samples are taken of between 1 and
|
|
63 |
We take 35 samples for each N, so 35 times 1 snippet, 35 times 2 snippets, ... leading to a total of 350 snippet photos per category.
|
64 |
Then, in total, the set of lab snippets consists of 350 * 38 = 13.300 images.
|
65 |
|
66 |
-
###
|
67 |
|
68 |
For some of the categories, we have created photos that are more realistically taken at a crime scene.
|
69 |
As we expect the model to work better when there is less background noise in the image, we have created photos that we believe are reasonably comparable to what may be done in crime scene circumstances.
|
70 |
Therefore, we have mostly created images where snippets are laid out on so-called 'DNA blankets', which may be green or blue in appearance but at least produce a somewhat plain background.
|
71 |
In total, we have 2489 such photos available, from 7 different categories of fireworks.
|
72 |
|
73 |
-
###
|
74 |
|
75 |
As the embedding model must produce embeddings for all fireworks categories, and we do not have snippets available for every category, we also create ‘artificial snippets’ by taking random crops from each fireworks wrapper.
|
76 |
These artificial snippet photos consist of between 1 and 10 snippets, and we construct 35 images for each wrapper.
|
@@ -89,7 +89,7 @@ As the mock-crime scene dataset only consists of 7 classes, we are unable to con
|
|
89 |
In practice, a drop in performance may of course be expected in the worst-case scenario for (mock-)crime scene snippets.
|
90 |
Overall, we find that the model performs very well for classes that are present in the train set, and that the text filter gives a significant boost if this is not the case.
|
91 |
|
92 |
-
###
|
93 |
|
94 |
|
95 |
| Metric | Worst-case (without text filter) | Best-case (without text filter) | Worst-case (with text filter) | Best-case (with text filter) |
|
@@ -100,7 +100,7 @@ Overall, we find that the model performs very well for classes that are present
|
|
100 |
| Accuracy @ 10| 0.48 | 1.00 | 0.84 | 1.00 |
|
101 |
| Accuracy @ 25| 0.75 | 1.00 | 0.90 | 1.00 |
|
102 |
|
103 |
-
###
|
104 |
|
105 |
| Metric | Without text filter | With text filter |
|
106 |
|--------------|------------------------|------------------------|
|
@@ -112,7 +112,7 @@ Overall, we find that the model performs very well for classes that are present
|
|
112 |
|
113 |
Note that the final model is trained on all data, so we expect performance to increase somewhat as compared to these metrics.
|
114 |
|
115 |
-
###
|
116 |
|
117 |
The evaluation results described above may not be representative of real-world performance of the model for several reasons.
|
118 |
Firstly, the model was only trained and evaluated on photos of snippets with relatively plain backgrounds and relatively good lightning conditions.
|
|
|
23 |
Secondly, the set of classes is dynamic and changes quickly over time (as more fireworks are added), and it is not feasible to train a new model each time.
|
24 |
Therefore, we require a one-shot model, which we construct as follows.
|
25 |
|
26 |
+
### _Embedding model_
|
27 |
|
28 |
First, we train an embedder that produces similar embeddings for snippets and wrappers of the same category, and dissimilar embeddings for different categories.
|
29 |
The embedding model is based on the Vision Transformer architecture (see [arXiv](https://arxiv.org/abs/2010.11929)).
|
|
|
36 |
* Fixed learning rate of 0.000015 with Adam optimizer
|
37 |
* Epochs: 100
|
38 |
|
39 |
+
### _Classification_
|
40 |
|
41 |
To be able to link a photo of snippets to a firework category, we construct reference images based on the wrappers in each category (see “data” for a description), which we convert into reference embeddings using the trained embedding model.
|
42 |
In the same way, we create an embedding for the snippet photo.
|
43 |
To produce a classification, we calculate the L2 distance (normalized between 0-1) between the snippet photo embedding and the reference embeddings of each of the categories.
|
44 |
The minimum distance across all reference embeddings for each category is taken as the representative score for that category.
|
45 |
|
46 |
+
#### _Text filter_
|
47 |
|
48 |
Optionally, a text filter is applied on top of the classification that filters the fireworks labels based on the text that is on the snippet.
|
49 |
The text on the snippets must be manually entered.
|
|
|
55 |
The final model (provided here) is trained on all data.
|
56 |
For evaluation purposes, we split the data in a train and test set, which we describe under (“evaluation”).
|
57 |
|
58 |
+
### _Lab snippets_
|
59 |
|
60 |
For 38 categories of fireworks, we have created snippets of their wrappers by exploding these fireworks.
|
61 |
These snippets are photographed with a high-quality DSLR camera on a white background, directly from above, with good lighting conditions (hence ‘lab’ snippets).
|
|
|
63 |
We take 35 samples for each N, so 35 times 1 snippet, 35 times 2 snippets, ... leading to a total of 350 snippet photos per category.
|
64 |
Then, in total, the set of lab snippets consists of 350 * 38 = 13.300 images.
|
65 |
|
66 |
+
### _Mock-crime scene snippets_
|
67 |
|
68 |
For some of the categories, we have created photos that are more realistically taken at a crime scene.
|
69 |
As we expect the model to work better when there is less background noise in the image, we have created photos that we believe are reasonably comparable to what may be done in crime scene circumstances.
|
70 |
Therefore, we have mostly created images where snippets are laid out on so-called 'DNA blankets', which may be green or blue in appearance but at least produce a somewhat plain background.
|
71 |
In total, we have 2489 such photos available, from 7 different categories of fireworks.
|
72 |
|
73 |
+
### _Artificial snippets_
|
74 |
|
75 |
As the embedding model must produce embeddings for all fireworks categories, and we do not have snippets available for every category, we also create ‘artificial snippets’ by taking random crops from each fireworks wrapper.
|
76 |
These artificial snippet photos consist of between 1 and 10 snippets, and we construct 35 images for each wrapper.
|
|
|
89 |
In practice, a drop in performance may of course be expected in the worst-case scenario for (mock-)crime scene snippets.
|
90 |
Overall, we find that the model performs very well for classes that are present in the train set, and that the text filter gives a significant boost if this is not the case.
|
91 |
|
92 |
+
### _Lab snippets_
|
93 |
|
94 |
|
95 |
| Metric | Worst-case (without text filter) | Best-case (without text filter) | Worst-case (with text filter) | Best-case (with text filter) |
|
|
|
100 |
| Accuracy @ 10| 0.48 | 1.00 | 0.84 | 1.00 |
|
101 |
| Accuracy @ 25| 0.75 | 1.00 | 0.90 | 1.00 |
|
102 |
|
103 |
+
### _Mock snippets_
|
104 |
|
105 |
| Metric | Without text filter | With text filter |
|
106 |
|--------------|------------------------|------------------------|
|
|
|
112 |
|
113 |
Note that the final model is trained on all data, so we expect performance to increase somewhat as compared to these metrics.
|
114 |
|
115 |
+
### _Limitations_
|
116 |
|
117 |
The evaluation results described above may not be representative of real-world performance of the model for several reasons.
|
118 |
Firstly, the model was only trained and evaluated on photos of snippets with relatively plain backgrounds and relatively good lightning conditions.
|