[datasets] Extend the range of public datasets supported in docTR

Currently, we support `FUNSD`, `CORD` and `SROIE` but we should look at extending the range of supported datasets. Among others, we could include handwritten, and in-the-wild situations.

Here is a list of datasets you can usually find in OCR-related benchmarks:
- [x] IIIT-5k (https://cvit.iiit.ac.in/research/projects/cvit-projects/the-iiit-5k-word-dataset) #589
- [x] SVT (http://vision.ucsd.edu/~kai/svt/) #597 #620
- [x] IC03 (http://www.iapr-tc11.org/mediawiki/index.php?title=ICDAR_2003_Robust_Reading_Competitions) #653
- [x] IC13 (http://dagdata.cvc.uab.es/icdar2013competition/?ch=2&com=downloads) #662
- [x] SVHN (http://ufldl.stanford.edu/housenumbers/) #634
- [x] SynthText (https://github.com/ankush-me/SynthText) #624
- [x] IMGUR5K (https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) #785 

Of course, the list goes on

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets] Extend the range of public datasets supported in docTR #587

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development