[datasets] Add MJSynth (Synth90K) #827

felixdittrich92 · 2022-02-21T11:14:24Z

This PR integrates the ability to load the MJSynth dataset MJSynth
This one is the first pure recognition dataset with ~ 9M english word images

It´s used in many other repositories to train a recognition model (in some cases in addition with crops from SynthText (maybe prepare this later)) before this will be evaluated on IC03, SVT, SVHN, and so on.

I have decided to integrate it not with the download link in fact that the http download takes ~48hrs and the BitTorrent download is a lot faster.

Any feedback is welcome 🤗

sample:

codecov · 2022-02-21T11:23:32Z

Codecov Report

Merging #827 (d42214e) into main (56b914c) will increase coverage by 0.02%.
The diff coverage is 95.65%.

@@            Coverage Diff             @@
##             main     #827      +/-   ##
==========================================
+ Coverage   94.70%   94.73%   +0.02%     
==========================================
  Files         133      134       +1     
  Lines        5442     5465      +23     
==========================================
+ Hits         5154     5177      +23     
  Misses        288      288

Flag	Coverage Δ
unittests	`94.73% <95.65%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/datasets/mjsynth.py	`95.45% <95.45%> (ø)`
doctr/datasets/__init__.py	`100.00% <100.00%> (ø)`
doctr/transforms/functional/base.py	`97.10% <0.00%> (+1.44%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56b914c...d42214e. Read the comment docs.

fg-mindee

Thanks Felix!

General comments on this:

I think we should discuss integrating new datasets. There are already quite a lot available through docTR, and we should assess whether it's worth adding one first
about long downloads, that's an important point (major inconvenience): for reviewers, it means it's unlikely that we can download the whole thing to test it :/
the dataset is synthetic, so we should check whether it's worth integrating considering there is already one that can generate images on-the-fly
with the number of datasets, we need to find a good way to document / structure datasets in terms of target task: is this an image with OCR label? text recognition ? text detection? etc

doctr/datasets/mjsynth.py

tests/conftest.py

tests/pytorch/test_datasets_pt.py

tests/tensorflow/test_datasets_tf.py

felixdittrich92 · 2022-03-11T23:27:23Z

@fg-mindee
Any chance to integrate this before we (discuss / refactor) the datasets into ocr/detection/recognition splits/tasks ?

How i have wrote you the idea behind this is more for research purpose most model benchmarks (recognition) are created on Mjsynth + SynthText merge for train and svt, ic03, ic13, ... for eval .. so if we want to do something like this we need it 😅

And i agree the refactoring into task splits or better to say task specific totally 👍

felixdittrich92 · 2022-04-13T19:52:24Z

Depends on: #891 so for the moment revert to draft 😃

charlesmindee

Thanks!

charlesmindee · 2022-04-27T15:37:20Z

Just need to resolve conflicts!

felixdittrich92 · 2022-04-27T20:03:58Z

@charlesmindee ready to merge 👍

charlesmindee

Thanks!

felixdittrich92 added 6 commits January 11, 2022 08:34

backup

81c313e

Merge branch 'mindee:main' into main

50574b5

Merge branch 'mindee:main' into main

5a6ed54

Merge branch 'mindee:main' into main

b9958a7

Merge branch 'mindee:main' into main

14c4651

Merge branch 'mindee:main' into main

779731f

fg-mindee self-assigned this Feb 21, 2022

fg-mindee added module: datasets Related to doctr.datasets type: new feature New feature ext: docs Related to docs folder labels Feb 21, 2022

fg-mindee reviewed Feb 21, 2022

View reviewed changes

felixdittrich92 requested a review from fg-mindee February 21, 2022 20:24

Merge branch 'mindee:main' into main

ce2cdda

felixdittrich92 force-pushed the mjsynth branch from 7cb7d81 to 37209b0 Compare February 22, 2022 11:24

Merge branch 'mindee:main' into main

d13dc43

felixdittrich92 force-pushed the mjsynth branch from 37209b0 to 02a26d3 Compare February 23, 2022 12:37

felixdittrich92 added 6 commits February 24, 2022 08:35

Merge branch 'mindee:main' into main

9a07d73

Merge branch 'mindee:main' into main

a002a70

Merge branch 'mindee:main' into main

6ad096e

Merge branch 'mindee:main' into main

1e77fd4

Merge branch 'mindee:main' into main

2be762c

Merge branch 'mindee:main' into main

e2f2055

felixdittrich92 force-pushed the mjsynth branch from 02a26d3 to 063b9a6 Compare March 11, 2022 22:11

felixdittrich92 added 5 commits March 16, 2022 09:26

Merge branch 'mindee:main' into main

bdc4e67

Merge branch 'mindee:main' into main

b525021

Merge branch 'mindee:main' into main

417a27b

Merge branch 'mindee:main' into main

9b3f5a1

Merge branch 'mindee:main' into main

93074a8

felixdittrich92 added 2 commits April 5, 2022 21:23

Merge branch 'mindee:main' into main

7ac6ee2

Merge branch 'mindee:main' into main

1c79f32

fharper requested review from charlesmindee and removed request for fg-mindee April 8, 2022 18:47

Merge branch 'mindee:main' into main

45e43ac

felixdittrich92 marked this pull request as draft April 13, 2022 19:51

Merge branch 'mindee:main' into main

53ba4b9

charlesmindee reviewed Apr 27, 2022

View reviewed changes

charlesmindee marked this pull request as ready for review April 27, 2022 15:36

charlesmindee previously approved these changes Apr 27, 2022

View reviewed changes

felixdittrich92 added 5 commits April 27, 2022 21:40

Merge branch 'mindee:main' into main

96b597c

add mjsynth loader

5eebf06

apply changes

c1b714e

rename

d7bbe81

update

4f45239

felixdittrich92 dismissed charlesmindee’s stale review via 4f45239 April 27, 2022 19:55

felixdittrich92 force-pushed the mjsynth branch from 7f4cbce to 4f45239 Compare April 27, 2022 19:55

felixdittrich92 added 2 commits April 27, 2022 21:59

update

7877ac5

fix tests

d42214e

felixdittrich92 requested review from fg-mindee and charlesmindee April 27, 2022 20:03

charlesmindee approved these changes Apr 28, 2022

View reviewed changes

charlesmindee merged commit f9a1912 into mindee:main Apr 28, 2022

felixdittrich92 deleted the mjsynth branch April 28, 2022 13:29

frgfm added the ext: tests Related to tests folder label May 2, 2022

frgfm added this to the 0.6.0 milestone May 2, 2022

frgfm mentioned this pull request Jun 28, 2022

Release tracker - v0.6.0 #791

Closed

85 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets] Add MJSynth (Synth90K) #827

[datasets] Add MJSynth (Synth90K) #827

felixdittrich92 commented Feb 21, 2022 •

edited

Loading

codecov bot commented Feb 21, 2022 •

edited

Loading

fg-mindee left a comment •

edited

Loading

felixdittrich92 commented Mar 11, 2022

felixdittrich92 commented Apr 13, 2022

charlesmindee left a comment

charlesmindee commented Apr 27, 2022

felixdittrich92 commented Apr 27, 2022

charlesmindee left a comment

[datasets] Add MJSynth (Synth90K) #827

[datasets] Add MJSynth (Synth90K) #827

Conversation

felixdittrich92 commented Feb 21, 2022 • edited Loading

codecov bot commented Feb 21, 2022 • edited Loading

Codecov Report

fg-mindee left a comment • edited Loading

Choose a reason for hiding this comment

felixdittrich92 commented Mar 11, 2022

felixdittrich92 commented Apr 13, 2022

charlesmindee left a comment

Choose a reason for hiding this comment

charlesmindee commented Apr 27, 2022

felixdittrich92 commented Apr 27, 2022

charlesmindee left a comment

Choose a reason for hiding this comment

felixdittrich92 commented Feb 21, 2022 •

edited

Loading

codecov bot commented Feb 21, 2022 •

edited

Loading

fg-mindee left a comment •

edited

Loading