Question regarding the ONNX version of the model.

#5
by KukaiN - opened

Hi I'm currently trying to write some code to use the ONNX version of the swinv2-v2 tagger and I'm currently having trouble because the batch size is not a dynamic value. I currently have a np array of shape (b, h, w, c) and I want to use a batch size of 4 or 8.

When I load the ONNX model and inspect the input, it's not a dynamic shape ("N") but the batch dim is set to 1.
[name: "input_1:0"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_param: 1
}
dim {
dim_value: 448
}
dim {
dim_value: 448
}
dim {
dim_value: 3
}}}}
]

I tried to load and convert the onnx model to have a set batch size of 4 but it fails because a reshape in the model is also fixed to the batch size. Can you help modify the ONNX model to accommodate a different batch size or are you able to publish a newer version of the ONNX model that has the batch size set to a dynamic value?

Edit: I also know that it was fixed for the v3 version, but I want to compare performances and see if I should switch to v3 so that's also why I'm asking.

Yes, you are correct, when I was in the process of releasing the models TF was the preferred runtime for batch inference, and onnxruntime was targeted for fast, single image CPU inference. Then they decided to remove GPU support on Windows and I changed focus for the V3 series.

If it's really only a one-off, can I ask if it would be possible to do the conversion on your end?
You can use the script by the JoyTag model author here: https://github.com/fpgaminer/joytag/blob/main/validation-arena/export_sw.py
I'm fairly sure you can use the same script for all other models in the v2 series, too.

One more thing, if everything works out I'm going to release a better checkpoint in a couple days, you may want to withhold judgement until then.

Thank you for the resources. I will be looking forward for a new model (if it works out).

There was one thing that was bugging me. Is there a particular reason why the input to the model use 0 ~ 255 for representing the input image? Usually when I read tutorials and other AI papers, I tend to see normalized ranges between 0 ~ 1.0 so the gradient doesn't skyrocket. <- I was wondering why my output was bad and it was because I was feeding 0~1 ranged images. I found out after looking at your image processing.

The model expects an image as such:

  • BGR - due to a legacy OpenCV pipeline, this channel order has infected ~6 years worth of code I have written and at this point I'm stuck with it;
  • -1..1 because this range worked better with NFNets, and it made no difference otherwise;
    • The rescaling 0..255 -> -1..1 is embedded in the ONNX and TF graph (where applicable) for convenience;
    • This has to be handled explicitly when using the new JAX or timm models;
  • RGBA images are converted to RGB with alpha to white, mainly so that images with transparent background get a white background.

Please check out https://github.com/neggles/wdv3-timm for an example of what preprocessing looks like with v3 models and timm.

KukaiN changed discussion status to closed

Sign up or log in to comment