Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasets] Add MJSynth (Synth90K) #827

Merged
merged 33 commits into from
Apr 28, 2022
Merged

Conversation

felixdittrich92
Copy link
Contributor

@felixdittrich92 felixdittrich92 commented Feb 21, 2022

This PR integrates the ability to load the MJSynth dataset MJSynth
This one is the first pure recognition dataset with ~ 9M english word images

It´s used in many other repositories to train a recognition model (in some cases in addition with crops from SynthText (maybe prepare this later)) before this will be evaluated on IC03, SVT, SVHN, and so on.

I have decided to integrate it not with the download link in fact that the http download takes ~48hrs and the BitTorrent download is a lot faster.

Any feedback is welcome 🤗

sample:
download

@codecov
Copy link

codecov bot commented Feb 21, 2022

Codecov Report

Merging #827 (d42214e) into main (56b914c) will increase coverage by 0.02%.
The diff coverage is 95.65%.

@@            Coverage Diff             @@
##             main     #827      +/-   ##
==========================================
+ Coverage   94.70%   94.73%   +0.02%     
==========================================
  Files         133      134       +1     
  Lines        5442     5465      +23     
==========================================
+ Hits         5154     5177      +23     
  Misses        288      288              
Flag Coverage Δ
unittests 94.73% <95.65%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/mjsynth.py 95.45% <95.45%> (ø)
doctr/datasets/__init__.py 100.00% <100.00%> (ø)
doctr/transforms/functional/base.py 97.10% <0.00%> (+1.44%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56b914c...d42214e. Read the comment docs.

@fg-mindee fg-mindee self-assigned this Feb 21, 2022
@fg-mindee fg-mindee added module: datasets Related to doctr.datasets type: new feature New feature ext: docs Related to docs folder labels Feb 21, 2022
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Felix!

General comments on this:

  • I think we should discuss integrating new datasets. There are already quite a lot available through docTR, and we should assess whether it's worth adding one first
  • about long downloads, that's an important point (major inconvenience): for reviewers, it means it's unlikely that we can download the whole thing to test it :/
  • the dataset is synthetic, so we should check whether it's worth integrating considering there is already one that can generate images on-the-fly
  • with the number of datasets, we need to find a good way to document / structure datasets in terms of target task: is this an image with OCR label? text recognition ? text detection? etc

doctr/datasets/mjsynth.py Outdated Show resolved Hide resolved
doctr/datasets/mjsynth.py Outdated Show resolved Hide resolved
doctr/datasets/mjsynth.py Outdated Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
tests/pytorch/test_datasets_pt.py Outdated Show resolved Hide resolved
tests/tensorflow/test_datasets_tf.py Outdated Show resolved Hide resolved
@felixdittrich92
Copy link
Contributor Author

@fg-mindee
Any chance to integrate this before we (discuss / refactor) the datasets into ocr/detection/recognition splits/tasks ?

How i have wrote you the idea behind this is more for research purpose most model benchmarks (recognition) are created on Mjsynth + SynthText merge for train and svt, ic03, ic13, ... for eval .. so if we want to do something like this we need it 😅

And i agree the refactoring into task splits or better to say task specific totally 👍

@fharper fharper requested review from charlesmindee and removed request for fg-mindee April 8, 2022 18:47
@felixdittrich92 felixdittrich92 marked this pull request as draft April 13, 2022 19:51
@felixdittrich92
Copy link
Contributor Author

Depends on: #891 so for the moment revert to draft 😃

Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@charlesmindee charlesmindee marked this pull request as ready for review April 27, 2022 15:36
charlesmindee
charlesmindee previously approved these changes Apr 27, 2022
@charlesmindee
Copy link
Collaborator

Just need to resolve conflicts!

@felixdittrich92
Copy link
Contributor Author

@charlesmindee ready to merge 👍

Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@charlesmindee charlesmindee merged commit f9a1912 into mindee:main Apr 28, 2022
@felixdittrich92 felixdittrich92 deleted the mjsynth branch April 28, 2022 13:29
@frgfm frgfm added the ext: tests Related to tests folder label May 2, 2022
@frgfm frgfm added this to the 0.6.0 milestone May 2, 2022
@frgfm frgfm mentioned this pull request Jun 28, 2022
85 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: docs Related to docs folder ext: tests Related to tests folder module: datasets Related to doctr.datasets type: new feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants