hello i have a big question

by DucHaiten - opened Mar 24, 2024

Mar 24, 2024

I see you created many different versions of v3, so are they actually any different? and how should I use each version for its own purposes?

SmilingWolf

Owner Mar 24, 2024

I read a paper with a network architecture I like, I implement said architecture, then I train it from scratch.
The v3 models currently out are implementations of those papers:
ViT: https://arxiv.org/abs/2010.11929
ConvNext: https://arxiv.org/abs/2201.03545
SwinV2: https://arxiv.org/abs/2111.09883

They all do the same thing (get an image in input, give tag probabilities as output), but get there differently.

Does this answer the question?

DucHaiten

Mar 27, 2024

So you just know it's different, which one is better than the other, and which one specializes in which art style, you don't know either

SmilingWolf

Owner Apr 18, 2024

•

edited Apr 18, 2024

Yeah I have no idea if SwinV2 works better on some specific images and ConvNext/ViT work better on others.

Going on a limb, ConvNext might be better suited to deal with rotated images, while ViT might work better at character recognition given how transformers can model long range dependencies and a character might be defined by a few details scattered through the entire image, but I never run any deep test in this sense.

Post the results if you happen to run any such test!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment