-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[datasets] Add MJSynth (Synth90K) #827
Conversation
Codecov Report
@@ Coverage Diff @@
## main #827 +/- ##
==========================================
+ Coverage 94.70% 94.73% +0.02%
==========================================
Files 133 134 +1
Lines 5442 5465 +23
==========================================
+ Hits 5154 5177 +23
Misses 288 288
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Felix!
General comments on this:
- I think we should discuss integrating new datasets. There are already quite a lot available through docTR, and we should assess whether it's worth adding one first
- about long downloads, that's an important point (major inconvenience): for reviewers, it means it's unlikely that we can download the whole thing to test it :/
- the dataset is synthetic, so we should check whether it's worth integrating considering there is already one that can generate images on-the-fly
- with the number of datasets, we need to find a good way to document / structure datasets in terms of target task: is this an image with OCR label? text recognition ? text detection? etc
7cb7d81
to
37209b0
Compare
37209b0
to
02a26d3
Compare
02a26d3
to
063b9a6
Compare
@fg-mindee How i have wrote you the idea behind this is more for research purpose most model benchmarks (recognition) are created on Mjsynth + SynthText merge for train and svt, ic03, ic13, ... for eval .. so if we want to do something like this we need it 😅 And i agree the refactoring into task splits or better to say task specific totally 👍 |
Depends on: #891 so for the moment revert to draft 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Just need to resolve conflicts! |
7f4cbce
to
4f45239
Compare
@charlesmindee ready to merge 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR integrates the ability to load the MJSynth dataset MJSynth
This one is the first pure recognition dataset with ~ 9M english word images
It´s used in many other repositories to train a recognition model (in some cases in addition with crops from SynthText (maybe prepare this later)) before this will be evaluated on IC03, SVT, SVHN, and so on.
I have decided to integrate it not with the download link in fact that the http download takes ~48hrs and the BitTorrent download is a lot faster.
Any feedback is welcome 🤗
sample: