Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Torch Data Augmentation #777

Merged
merged 1 commit into from Jul 26, 2016
Merged

Torch Data Augmentation #777

merged 1 commit into from Jul 26, 2016

Conversation

TimZaman
Copy link
Contributor

@TimZaman TimZaman commented May 24, 2016

Data augmentation needs little introduction I recon. It counters overfitting and makes your model generalize better, yielding better validation accuracies; or alternatively, allows you to use smaller datasets with similar performance.

In the Zoo that's the internet, I see many implementations of different augmentations, of which few are proper and nicely portable. A part from Digits yielding a great UI; ease of use; and deep learning turn-key solution, I strongly feel we can expand to the functional side as well to make this a deep learning killer-app.

For torch, I have made an implementation during lua preprocessing from frontend to backend to enable Digits to do so. In #330 there was already an attempt for augmentation, which happened on the dataset-creation side; something I am strongly against. Resizing and cropping I would consider a transformation, while I consider augmenting the data in its container an augmentation. I think therefore it's fine to resize during dataset loading (and squashing/filling/etc), but I would probably leave it at that.

Anyway, I set up a more dynamic structure to pass around these options on the torch side; instead of adding a dozen of arguments to each function, I am just adding a table.

Implements the following (screenshot):
image

I have iterated through many augmentation types but these were the most useful. Almost done, now running elaborate tests.

Progress

The code is already functional, though see progress below.
See code, shoot!

Features

  • Make UI data transforms only visible for the Torch framework (invisible for Caffe)
  • Implement UI option for normalization (scales the [0 255] to [0 1])
  • Data Augmentation UI
  • Flips (mirrors)
  • Quadrilateral rotations
  • Arbitrary rotations
  • Arbitrary scales
  • Augmenting in HSV space
  • Augmenting with noise (Thoughts?)
  • [Travis] Tests
  • Use Data Augmentation Template: data_augmentation.html

Testing

  • No augmentation
  • Flips (mirrors)
  • Quadrilateral rotations
  • Arbitrary rotations
  • Arbitrary scales
  • Arbitrary rotations & arbitrary scales
  • Augmenting in HSV space
  • Augmenting with noise
  • All Augmentations & benchmark speed; identify bottlenecks
  • Verify models reporting a slower learning/less overfitting trade-off : more generalization.

@TimZaman
Copy link
Contributor Author

TimZaman commented May 24, 2016

It this also the place to discuss the following:

  • The default 'Mean Subtraction' is one 'Image'. I have never really seen any model of mine perform better than 'None', only worse. I would suggest a default to None (moreover, it's hefty on the CPU).
  • test.lua expects square croplen images through an assert, whereas main.lua magically takes the minimum dimension of (croplen,x-dim,y-dim) : and might ignore the crop param altogether withouth the operator knowing about this
  • The input of the image (during and after torch preprocessing) is always in the range [0 255], I guess due to the backend (lmdb/hdf5) constraints. Shouldn't we by default scale this to [0 1]? I recollect the LeNet standard network have the :mul(1/255) in there as first layer; what's the general AI network convention here? We can also make normalization optional?

@lukeyeager
Copy link
Member

The default 'Mean Subtraction' is one 'Image'. I have never really seen any model of mine perform better than 'None', only worse. I would suggest a default to None (moreover, it's hefty on the CPU).

Really? Interesting. I'm pretty sure LeNet won't work, at least.

And I'm also pretty sure that AlexNet and GoogLeNet both had some form of mean subtraction in their original implementations.

@TimZaman
Copy link
Contributor Author

Really? Interesting. I'm pretty sure LeNet won't work, at least.
And I'm also pretty sure that AlexNet and GoogLeNet both had some form of mean subtraction in their original implementations.

What I have come to realize, is that that Deep Learning is so complex and there are so many bells and whistles, in many ways it's not an exact science anymore, but just scientific guesses and gut feelings. If you are using mean subtraction all the time, you will probably keep on using it for forever, without really knowing why. For example: same thing goes for MaxPooling. In most cases you might as well increase the stride without any loss It's something someone thought of, seemed to work well, or it didn't, then didn't bother much about it.

Same thing goes for preprocessing to a different colorspace: RGB to HSV, or RGB with local normalization, RGB with super fancy ZCA whitening. In essence, you are never actually adding any information (obviously), you can only take information out. What you can do is stress certain parts (like you do with enhancing edge-values with local contrast or whitening); but again, deep networks can figure that out without the help of our stresses. In my experience, RGB does fine.

Lets look at what mean subtraction does for MNIST:
image

My gut feeling says the mean-subtracted image probably only makes it worse (what does the mean of all these images mean anyway).

Lets investigate MNIST with default DIGITS settings:

Type Max Accuracy % Final Accuracy %
Image 97.92% 97.94%
Pixel 98.19% 98.15%
None 98.17% 98.16%

The above table is in agreement with what i said earlier: why would image subtraction help? We do this originally just to put the mean at 0 because we learned from statistics that's proper. But the deep neural network can handle mostly all ranges just fine, as long as the hyper parameters are balanced to the input.

Another example then, the mean image of a Kaggle challenge i found:
mean
Subtracting that mean image ^ will yield really weird results.

Or the mean image of ImageNet:
mean
This gray blob won't help your ImageNet accuracy one bit ^.

In their original implementations I bet many used mean subtraction, but that does not mean it yields the best results, which i think is more important.

See if you can reproduce the results from the above table. Why did you say it won't work for LeNet?

@gheinrich
Copy link
Contributor

This can be debated at length but a reference paper on the subject is Efficient Backprop. Normalizing inputs and initializing weights carefully aims at sticking to the range where transfer functions are nicely non-linear and have non-zero gradients. You can always get around this after some learning (the network will learn how much bias to apply to the receptive fields) but sometimes it is almost impossible to learn efficiently. If the input to e.g. a sigmoid is very large then the gradient is almost zero and it is very difficult to learn.

In the example from the regression tutorial I've seen cases where the network diverges (loss->NaN) without mean subtraction. This has never happened with mean subtraction. The learning curve is smoother with mean image subtraction:
screenshot from 2016-05-24 21 50 35
Compare with no mean subtraction:
screenshot from 2016-05-24 21 51 44

The common practice is to normalize inputs so I guess that's what we need to do by default in DIGITS. On that subject you're making a good point about the pixel range:

The input of the image (during and after torch preprocessing) is always in the range [0 255]

I agree it would make more sense to be in the [0 1] range. We could very well do that in Torch but the thing is this isn't what Caffe is doing. Since DIGITS originally only supported Caffe I opted for the [0 255] range to get as close a user experience as possible. I fear this will make us suffer when we want to support 16-bit images!

Thanks a lot for the PR this is great!

@lukeyeager
Copy link
Member

See if you can reproduce the results from the above table. Why did you say it won't work for LeNet?

Oh you're right! I was thinking of something else, related to something else you brought up:

The input of the image (during and after torch preprocessing) is always in the range [0 255], I guess due to the backend (lmdb/hdf5) constraints. Shouldn't we by default scale this to [0 1]? I recollect the LeNet standard network have the :mul(1/255) in there as first layer; what's the general AI network convention here? We can also make normalization optional?

AlexNet and GoogLeNet both work for [0-1] and [0-255] data, but LeNet only works on [0-1] data.

So nevermind about the mean subtraction - my bad!

@gheinrich
Copy link
Contributor

Make UI data transforms only visible for the Torch framework (invisible for Caffe)

For this you could have a look at the data shuffle capability and how it's used in the templates to selectively enable the corresponding form field.

@@ -356,16 +356,104 @@ <h4 style="display:inline-block;">Python Layers</h4>
<div class="col-sm-4">
<div class="well">
<h4>Data Transformations</h4>
<div class="form-group{{mark_errors([form.use_mean])}}">
Copy link
Contributor

@gheinrich gheinrich May 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we haven't done enough of this but you could perhaps move this section to a data_augmentation.html template and include it here and in digits/templates/models/images/generic/new.html

@TimZaman
Copy link
Contributor Author

Come to think of it, indeed I think LeNet was meant for bitonal images, so
probably no default mean subtraction at all. Moreover, you don't need mean
subtraction from the dataset to yield a range that complies with the paper
you are refering to, we do not need to discuss that at length at all,
because you are right. But as just shown with MNIST, mean-subtraction by
the pixel-for-pixel mean of all images the dataset will never yield better
results; especially when the dataset does not contain hundreds of thousands
of images. Therefore i coined not using 'Image' as mean subtraction by
default.

On Tue, May 24, 2016 at 10:46 PM, Luke Yeager notifications@github.com
wrote:

See if you can reproduce the results from the above table. Why did you say
it won't work for LeNet?

Oh you're right! I was thinking of something else, related to something
else you brought up:

The input of the image (during and after torch preprocessing) is always in
the range [0 255], I guess due to the backend (lmdb/hdf5) constraints.
Shouldn't we by default scale this to [0 1]? I recollect the LeNet standard
network have the :mul(1/255) in there as first layer; what's the general AI
network convention here? We can also make normalization optional?

AlexNet and GoogLeNet both work for [0-1] and [0-255] data, but LeNet only
works on [0-1] data.

So nevermind about the mean subtraction - my bad!


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#777 (comment)

@TimZaman
Copy link
Contributor Author

TimZaman commented May 25, 2016

For this you could have a look at the data shuffle capability and how it's used in the templates to selectively enable the corresponding form field.

Thanks, that helped. I made it fairly dynamic now, and made a neat template. I took the liberty of putting the template in digits/templates/models/ because it can be used by a generic class and classification one, hope that's allright. Actually I'd rather put it in digits/templates/models/images/ but depends on the following:
In efforts to reduce redundancy we can merge the identical files:
generic/large_graph.html with classification/large_graph.html
generic/custom_network_explanation.html with classification/custom_network_explanation.html
and maybe put those in a more general folder.

@pansk
Copy link

pansk commented May 25, 2016

I like this feature, that's basically more than what I've done in my matlab preprocessing step (except the multiscale extraction - I also extract regions from the same image downsampled 2x, 4x, etc)

About mean subtraction, in my experience with autoencoders it did help (at least in the first tests I took), so I ended up using it by default. I'll run a test on one of the latest models right now, and see what happens.

@TimZaman
Copy link
Contributor Author

TimZaman commented May 26, 2016

I like this feature

Perfect, would you like to help test when it's ready for review? Do you have any other augmentation steps that work well? For example, I have not included 'blurring' because it seemed to be relatively ineffective. How is multiscale extraction working for you?

About mean subtraction, in my experience with autoencoders it did help (at least in the first tests I took), so I ended up using it by default.

Sure it helps for the reasons accurately described above. But using 'Image' subtraction as opposed to 'Pixel' subtraction will probably not make an improvement I think.

@gheinrich
Copy link
Contributor

using 'Image' subtraction as opposed to 'Pixel' subtraction will probably not make an improvement I think

I agree: image subtraction probably only helps for MNIST where DIGITS are nicely centred in the image but for realistic datasets pixel subtraction might make more sense. Besides, image subtraction is painful to work with for networks that accept various inputs sizes like FCNs.

@pansk
Copy link

pansk commented May 26, 2016

@TimZaman sure, I'd gladly test!
Another augmentation I would like is a filtering pass (lowpass/highpass), but I think it's very application-specific, and it's probably not worth to generalise this step.

What about adding noise (with different scales)? E.g. augment a set adding images with 0%, 1%, 2%, 3%, 5%, 10% noise (list of noise % specified by the user).

Sure it helps for the reasons accurately described above. But using 'Image' subtraction as opposed to 'Pixel' subtraction will probably not make an improvement I think.

I ran a full training on a previous network, removing the mean subtraction: the resulting quality is unchanged. Thank you for pointing this out! 👍

@TimZaman
Copy link
Contributor Author

High pass and low pass filtering.. Need to think about that one, how it
would help generalize. How do you use it now, DFT, or something simple?
Adding noise or blurring (blurring would already be low pass filtering i
guess) is straightforward and we can put it in, but often time i have seen
people reporting that it's almost the same as adding dropout: you're not
changing the image significantly enough; compare to a horizontal flip. A
horizontal flip for something like natural images almost doubles your
dataset because the image looks quite different. Whereas perturbing pixels
with noise is like dropping out some pixels.
Having said that, I might as well try. What are your experienced there?

On Thursday, 26 May 2016, Marco Foco notifications@github.com wrote:

@TimZaman https://github.com/TimZaman sure, I'd gladly test!
Another augmentation I would like is a filtering pass (lowpass/highpass),
but I think it's very application-specific, and it's probably not worth to
generalise this step.

What about adding noise (with different scales)? E.g. augment a set adding
images with 0%, 1%, 2%, 3%, 5%, 10% noise (list of noise % specified by the
user).

Sure it helps for the reasons accurately described above. But using
'Image' subtraction as opposed to 'Pixel' subtraction will probably not
make an improvement I think.

I ran a full training on a previous network, removing the mean
subtraction: the resulting quality is unchanged. Thank you for pointing
this out! 👍


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#777 (comment)

@TimZaman
Copy link
Contributor Author

Stumbled upon some bugs in torch/image while testing scale and rotation. torch/image#169
Now running CIFAR10 benchmarks for many augmentation types.

@pansk
Copy link

pansk commented May 27, 2016

I have tried both adding noise to original images (huge amount of training/test data is produced) and using dropout in place of noise.
I think adding a known amount of noise is more straightforward, since you can obtain the performance values for different noise scales directly from your test results.

@TimZaman
Copy link
Contributor Author

TimZaman commented May 27, 2016

I think adding a known amount of noise is more straightforward, since you can obtain the performance values for different noise scales directly from your test results.

Okay I'll see if I can add that. What kind of noise do you suggest, and what kind of underlying distribution? Matrix of a normal distribution then multiplying? Each channel seperate; or same for each? We're gonna need to make a ton of assumptions :).

('rot90', '0, 90 or 270 degrees'),
('rot180', '0 or 180 degrees'),
('rotall', '0, 90, 180 or 270 degrees.'),
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are rot180 and rotall useful? Rotation by 180 degrees is the same as vertical flipping.

Copy link
Contributor Author

@TimZaman TimZaman May 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost. A vertical flip + horizontal flip = 180 rotation.
There is an interesting case to be made then you have fliplrud on, which could flip by 180 degrees (chance 1 in 4). Then if you have also turned on a rot* rotation that includes the 180 degree rotation, statistically, your chance of getting a 180 rotation is slightly higher than any another flip or rotation, because of their redundance.

@TimZaman
Copy link
Contributor Author

TimZaman commented May 27, 2016

Initial results. Trained on CIFAR10 with a great VGG network with the overfit we love to see. (https://github.com/szagoruyko/cifar.torch/blob/master/models/vgg_bn_drop.lua)

The results reveal the augmentation is working really nicely. Training speed per epoch does not seem to be impacted.

Scale

scale

Rotation

arbrot

Flipping

hflip

HSV

HSV was a lot of fun because of the Wikipedia-copied implementation that's in Torch (HSV isn't that well standardized i guess), but at least it modules something that resembles HSV.
hsv

I also tested for speed, and this seems negligible; but a CPU that's up to the task is required (i.e. we need to be able to use 400% CPU for the four loader threads at times - this might be more on bigger images, although bigger images often require/use bigger (slower) networks.

@TimZaman
Copy link
Contributor Author

TimZaman commented May 30, 2016

All Augmentations

HFlip, rotation 5, scale stddev 0.05, hsv (0.01,0.02,0.04)
yields a full 3 percent validation accuracy increase and a validation loss decrease from 0.59 to 0.35

all

@gheinrich
Copy link
Contributor

That is totally awesome! That is a truly great feature! Can't wait to have that merged.

Do you think you can add tests for this (not necessarily to test that augmentation reduces overfit but at least to exercise the new code in the automatic tests). Do you need help there?

@TimZaman
Copy link
Contributor Author

Do you think you can add tests for this (...). Do you need help there?

Yes I need some help indeed; a few pointers would be great. Which tests do you suggest and where do you suggest I put them? I have not looked into how this is done in this project at all.

@pansk
Copy link

pansk commented May 30, 2016

I'd start with AWGN (Additive White Gaussian Noise) because it's easier to generate (just adding a 0-mean normal-distributed random variable will do), and you just need one parameter (SNR or standard deviation) to describe it (if you assume the signal's power at 0dB).
It is also a good approximation for thermal noise.

@TimZaman
Copy link
Contributor Author

TimZaman commented Jun 6, 2016

@pansk
Cropping happens at the very end.
I cannot reproduce your HSV error; can you tell me exactly what you did with what settings etc

Off-topic: I am in favor of using rescaling over cropping. A rescale augmentation is more methodologically correct in my opinion, mostly due to the fact that with cropping you generally have a problem of 'how do I validate my source images', which in Digits-torch is a center-crop.
If you stop using cropping, and use rescaling, you can still (when zoomed in or out) wiggle your image around the canvas: just as cropping (think about it..). Your mean 'rescale' will be 1:1, just like your validation pass image will be used entirely and 1:1.
But, for historic reasons everyone just uses cropping.

@@ -0,0 +1,67 @@
{# Copyright (c) 2014-2016, NVIDIA CORPORATION. All rights reserved. #}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment: you can make this 2016 only since this is a new file

@pansk
Copy link

pansk commented Jun 8, 2016

I'm sorry for the delay, I'm verifying, because it doesn't seem to converge well, and I want to be sure it's not something related to my specific model.

@TimZaman
Copy link
Contributor Author

TimZaman commented Jun 8, 2016

Strange. What kind of model are you using, and dataset? I verified this to
work well on a few different datasets, although the hsv augmentation could
be slightly improved. Especially the horizontal flipping is very
straightforward and should help. rotation of a few degrees helps on most
datasets, and scaling too.

On Wed, Jun 8, 2016 at 7:49 PM, Marco Foco notifications@github.com wrote:

I'm sorry for the delay, I'm verifying, because it doesn't seem to
converge well, and I want to be sure it's not something related to my
specific model.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#777 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHXSRN2xugocaVv-GxgrUFEndDHbXWMHks5qJwCigaJpZM4Ilgr1
.

@pansk
Copy link

pansk commented Jun 9, 2016

I'm pretty sure it's my code, that's why I'm going to try with a simpler autoencoder model, but I've to build a specific dataset for that (the MNIST is grayscale only).

@pansk
Copy link

pansk commented Jun 9, 2016

Just for reference, I was going to post this comment few days ago:

I'm training an autoencoder with some custom datasets with uniform images (like wood, stone walls, grass, and so on), so I don't really need to validate my input, but border effects might be annoying for me. On the other hand, I'd like to provide just a bunch of pictures (eventually, of different sizes) and let DIGITS generate its own set of images extracting random regions, rotating, scaling, adding noise, and so on. That's why I'd like to extract more than one region per source image.

My parameters were:

screenshot from 2016-06-07 11 23 09

For the general case, I agree for cropping vs. scaling (actually, scaling gives you more robustness).

Anyway, are you sure digits-torch just uses center-cropping? If so, the baloon help for crop parameter is a bit misleading: "If specified, during a training a random square crop will be taken from the input image before using as input for the network."

@gheinrich
Copy link
Contributor

are you sure digits-torch just uses center-cropping?

@TimZaman probably meant to say that we are doing centre cropping during validation. During training, we are doing random cropping.

@TimZaman
Copy link
Contributor Author

TimZaman commented Jun 9, 2016

are you sure digits-torch just uses center-cropping?

@TimZaman probably meant to say that we are doing centre cropping during validation. During training, we are doing random cropping.

Correct, I said verbatim

(...) with cropping you generally have a problem of 'how do I validate my source images', which in Digits-torch is a center-crop.

@pansk
Copy link

pansk commented Jun 9, 2016

Sorry Tim, I didn't notice you were referring to validation-only.
Do any of you know the reason behind this choice?

@TimZaman
Copy link
Contributor Author

TimZaman commented Jun 9, 2016

It's just a valid choice of many valid choices. If you want to be more fancy you can do a few crops and take the average (or the max before normalizing) from those. There's really no 'correct' way of validating when you are cropping, since you have to cut parts of your source image off during validation because your model is smaller than your actual images. But in practice, performance wise - it doesn't matter much.

@TimZaman
Copy link
Contributor Author

With the latest revision, and If Travis agrees with me, this PR is done I think.

@gheinrich
Copy link
Contributor

This all looks good to me, thanks for the awesome PR! Can you squash your commits (possibly rebase too)?

Implemented in python and ui

Implemented dynamic UI toggle and moved augmentation html to template

Fixes uncovered during testing

Added AWGN augmentation, reduced complexity, typos, syntax fixes

Implemented a test to check at least all augmentations will run

Added test initialization params

Trivial languages fixes and a few bugs
@TimZaman
Copy link
Contributor Author

Squashed & rebased.

@gheinrich
Copy link
Contributor

Splendid!

@philipperemy
Copy link

Any updates since then?

@lukeyeager
Copy link
Member

This looks good to me!

@gheinrich please merge unless you have more concerns.

@TimZaman
Copy link
Contributor Author

The 'data hook' idea has also grown on me. I think it's a great (and pretty
straightforward) feature. But I do like the UI this PR brings for obvious
reasons. It's too bad Caffe doesnt have great augmentation layers in its
master, although there are some pro forks.
Also, there is a proper PR in Caffe for a confusion matrix layer. That
would mean Caffe could show it during training if it's added, as with
torch, it can be easily captured with a regex. Mkay I digress.

On Tuesday, 26 July 2016, Luke Yeager notifications@github.com wrote:

This looks good to me!

@gheinrich https://github.com/gheinrich please merge unless you have
more concerns.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#777 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHXSRBiVbfbgfO6l8fF9mXTvyjUvnFAXks5qZUIqgaJpZM4Ilgr1
.

@gheinrich gheinrich merged commit 9ae9fa0 into NVIDIA:master Jul 26, 2016
SlipknotTN pushed a commit to cynnyx/DIGITS that referenced this pull request Mar 30, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants