-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any benchmark for the latest released model? #61
Comments
we already trained nfnet, regnet, ConvNeXt on danbooru2020 those too bad, and tagger is useless .... we don't know what this trained model can do you should checkout this paper, what's next "Transfer Learning for Pose Estimation of Illustrated Characters", Chen & Zwicker 2021 but downstream task experiment show us, too bad even not good as common trained models .... if you want image self-sup manner, that also bad Train vision models with vissl + illustrated images what we need VLM, text-image pretrained (like CLIP things) on LAION anime subset (remove tons anime uncorrelated datas) training and prepare this datas, not danbooru20xx trained, is a pre-trained dataset then we can do open vocab detector and a well captioner, enter the anime storytelling things |
I don't have any benchmark test/score of DeepDanbooru for latest model. |
@ControlNet I have the same impression as @koke2c95 that tagging models does not work well on danbooru data, as the tags are too noisy and the model can not leverage the relationship between the tags(concepts), which also adds to the noise. |
RM and sorry a issue mentioned on Multi-Modal-Comparators, seem it can't remove :( |
@koke2c95 Thank you for your reply. I decide to do simple auto-tagging (multi-label classification) with more lightweight and more accurate. So image generation or translation is not in my plan yet. You said the label is very low quality. I fully understand it as these tags are community-driven, so the noise cannot be avoided. I'm thinking if it's possible to employ some self supervise learning and weak supervise learning techniques to improve it. Also, Is there any tag-based anime image dataset with accurate labels? BTW, regarding the height-width ratio of these images varies very significantly, I doubt naive resizing may not extract the features well. Using sliding window might be a better choice.
Thank you for your reply. @Daniel8811 Thank you for your suggestions. I know CLIP is an amazing work, and it's robust for unseen data. But I'm not familiar with that. If the pretrained text encoder is used, I highly doubt the anime-style labels (yuri, genshin_impact, blue_hair) can be predicted well. |
That could be a problem. I agree that it's still unclear if CLIP would really work better on danbooru data at the time of speaking. @koke2c95 So I guess you are doing text2image at the moment. Do you have a thread or write-up for you exploration? |
@ControlNet |
@Daniel8811
Yes, it's possible although manually finding these "obvious" tags for thousands of tags are tricky. |
Hi,
I love your project. Could you please provide some benchmark (accuracy, f1, etc) about the latest pretrained model in the release?
That will be a great help because I also want to train some models (EfficientNet, ViT, etc) by myself. By comparing the benchmark scores, I could understand the performance of the model better.
The text was updated successfully, but these errors were encountered: