NetherlandsForensicInstitute
/

vuurwerkverkenner

English

Dutch

Model card Files Files and versions Community

kdebie-nfi commited on Mar 8, 2024

Commit

6af8fad

verified ·

1 Parent(s): 3740ba6

Update README.md

Browse files

Files changed (1) hide show

README.md +9 -9

README.md CHANGED Viewed

@@ -23,7 +23,7 @@ Firstly, we do not have photos of snippets available for all classes, but only f
 Secondly, the set of classes is dynamic and changes quickly over time (as more fireworks are added), and it is not feasible to train a new model each time.
 Therefore, we require a one-shot model, which we construct as follows.
-### Embedding model
 First, we train an embedder that produces similar embeddings for snippets and wrappers of the same category, and dissimilar embeddings for different categories.
 The embedding model is based on the Vision Transformer architecture (see [arXiv](https://arxiv.org/abs/2010.11929)).
@@ -36,14 +36,14 @@ It has the following specifications:
 * Fixed learning rate of 0.000015 with Adam optimizer
 * Epochs: 100
-### Classification
 To be able to link a photo of snippets to a firework category, we construct reference images based on the wrappers in each category (see “data” for a description), which we convert into reference embeddings using the trained embedding model.
 In the same way, we create an embedding for the snippet photo.
 To produce a classification, we calculate the L2 distance (normalized between 0-1) between the snippet photo embedding and the reference embeddings of each of the categories.
 The minimum distance across all reference embeddings for each category is taken as the representative score for that category.
-#### Text filter
 Optionally, a text filter is applied on top of the classification that filters the fireworks labels based on the text that is on the snippet.
 The text on the snippets must be manually entered.
@@ -55,7 +55,7 @@ The dataset on which the model is trained and evaluated is constructed from fire
 The final model (provided here) is trained on all data.
 For evaluation purposes, we split the data in a train and test set, which we describe under (“evaluation”).
-### Lab snippets
 For 38 categories of fireworks, we have created snippets of their wrappers by exploding these fireworks.
 These snippets are photographed with a high-quality DSLR camera on a white background, directly from above, with good lighting conditions (hence ‘lab’ snippets).
@@ -63,14 +63,14 @@ The snippets are then segmented, after which samples are taken of between 1 and
 We take 35 samples for each N, so 35 times 1 snippet, 35 times 2 snippets, ... leading to a total of 350 snippet photos per category.
 Then, in total, the set of lab snippets consists of 350 * 38 = 13.300 images.
-### Mock-crime scene snippets
 For some of the categories, we have created photos that are more realistically taken at a crime scene.
 As we expect the model to work better when there is less background noise in the image, we have created photos that we believe are reasonably comparable to what may be done in crime scene circumstances.
 Therefore, we have mostly created images where snippets are laid out on so-called 'DNA blankets', which may be green or blue in appearance but at least produce a somewhat plain background.
 In total, we have 2489 such photos available, from 7 different categories of fireworks.
-### Artificial snippets
 As the embedding model must produce embeddings for all fireworks categories, and we do not have snippets available for every category, we also create ‘artificial snippets’ by taking random crops from each fireworks wrapper.
 These artificial snippet photos consist of between 1 and 10 snippets, and we construct 35 images for each wrapper.
@@ -89,7 +89,7 @@ As the mock-crime scene dataset only consists of 7 classes, we are unable to con
 In practice, a drop in performance may of course be expected in the worst-case scenario for (mock-)crime scene snippets.
 Overall, we find that the model performs very well for classes that are present in the train set, and that the text filter gives a significant boost if this is not the case.
-### Lab snippets
 | Metric       | Worst-case (without text filter) | Best-case (without text filter) | Worst-case (with text filter) | Best-case (with text filter) |
@@ -100,7 +100,7 @@ Overall, we find that the model performs very well for classes that are present
 | Accuracy @ 10| 0.48       |  1.00     | 0.84       |  1.00     |
 | Accuracy @ 25| 0.75       |  1.00     | 0.90       |  1.00     |
-### Mock snippets
 | Metric       | Without text filter    | With text filter       |
 |--------------|------------------------|------------------------|
@@ -112,7 +112,7 @@ Overall, we find that the model performs very well for classes that are present
 Note that the final model is trained on all data, so we expect performance to increase somewhat as compared to these metrics.
-### Limitations
 The evaluation results described above may not be representative of real-world performance of the model for several reasons.
 Firstly, the model was only trained and evaluated on photos of snippets with relatively plain backgrounds and relatively good lightning conditions.

 Secondly, the set of classes is dynamic and changes quickly over time (as more fireworks are added), and it is not feasible to train a new model each time.
 Therefore, we require a one-shot model, which we construct as follows.
+### _Embedding model_
 First, we train an embedder that produces similar embeddings for snippets and wrappers of the same category, and dissimilar embeddings for different categories.
 The embedding model is based on the Vision Transformer architecture (see [arXiv](https://arxiv.org/abs/2010.11929)).
 * Fixed learning rate of 0.000015 with Adam optimizer
 * Epochs: 100
+### _Classification_
 To be able to link a photo of snippets to a firework category, we construct reference images based on the wrappers in each category (see “data” for a description), which we convert into reference embeddings using the trained embedding model.
 In the same way, we create an embedding for the snippet photo.
 To produce a classification, we calculate the L2 distance (normalized between 0-1) between the snippet photo embedding and the reference embeddings of each of the categories.
 The minimum distance across all reference embeddings for each category is taken as the representative score for that category.
+#### _Text filter_
 Optionally, a text filter is applied on top of the classification that filters the fireworks labels based on the text that is on the snippet.
 The text on the snippets must be manually entered.
 The final model (provided here) is trained on all data.
 For evaluation purposes, we split the data in a train and test set, which we describe under (“evaluation”).
+### _Lab snippets_
 For 38 categories of fireworks, we have created snippets of their wrappers by exploding these fireworks.
 These snippets are photographed with a high-quality DSLR camera on a white background, directly from above, with good lighting conditions (hence ‘lab’ snippets).
 We take 35 samples for each N, so 35 times 1 snippet, 35 times 2 snippets, ... leading to a total of 350 snippet photos per category.
 Then, in total, the set of lab snippets consists of 350 * 38 = 13.300 images.
+### _Mock-crime scene snippets_
 For some of the categories, we have created photos that are more realistically taken at a crime scene.
 As we expect the model to work better when there is less background noise in the image, we have created photos that we believe are reasonably comparable to what may be done in crime scene circumstances.
 Therefore, we have mostly created images where snippets are laid out on so-called 'DNA blankets', which may be green or blue in appearance but at least produce a somewhat plain background.
 In total, we have 2489 such photos available, from 7 different categories of fireworks.
+### _Artificial snippets_
 As the embedding model must produce embeddings for all fireworks categories, and we do not have snippets available for every category, we also create ‘artificial snippets’ by taking random crops from each fireworks wrapper.
 These artificial snippet photos consist of between 1 and 10 snippets, and we construct 35 images for each wrapper.
 In practice, a drop in performance may of course be expected in the worst-case scenario for (mock-)crime scene snippets.
 Overall, we find that the model performs very well for classes that are present in the train set, and that the text filter gives a significant boost if this is not the case.
+### _Lab snippets_
 | Metric       | Worst-case (without text filter) | Best-case (without text filter) | Worst-case (with text filter) | Best-case (with text filter) |
 | Accuracy @ 10| 0.48       |  1.00     | 0.84       |  1.00     |
 | Accuracy @ 25| 0.75       |  1.00     | 0.90       |  1.00     |
+### _Mock snippets_
 | Metric       | Without text filter    | With text filter       |
 |--------------|------------------------|------------------------|
 Note that the final model is trained on all data, so we expect performance to increase somewhat as compared to these metrics.
+### _Limitations_
 The evaluation results described above may not be representative of real-world performance of the model for several reasons.
 Firstly, the model was only trained and evaluated on photos of snippets with relatively plain backgrounds and relatively good lightning conditions.