Training errors - nans #19

Abelarm · 2024-04-11T08:16:33Z

Hi,

First of all thank for the great work!
everything works almost plug-and-play for inference.

I am having trouble while fine-tuning I usually got 2 errors:

logits: torch.Size([16, 512, 128])) to satisfy the constraint GreaterThan(lower_bound=0.0)

or

Categorical(logits: torch.Size([16, 512, 128, 4])) to satisfy the constraint IndependentConstraint(Real(), 1) in this case because there are some nulls in the tensor.

Do you have any idea how to solve it? or it's because there is something bad in my data?

The text was updated successfully, but these errors were encountered:

gorold · 2024-04-11T08:38:10Z

Could you provide a minimal reproducible example for this error?

Abelarm · 2024-04-11T10:14:21Z

I am digging in right now and looks like the error comes from the function _get_loc_scale at:

loc = reduce(
            id_mask * reduce(target * observed_mask, "... seq dim -> ... 1 seq", "sum"),
            "... seq1 seq2 -> ... seq1 1",
            "sum",
        )

None of the element of this reduce contains NaNs:

torch.isnan(id_mask).any()
tensor(False, device='mps:0')
torch.isnan(target).any()
tensor(False, device='mps:0')
torch.isnan(observed_mask).any()
tensor(False, device='mps:0')

but loc yes it does:

torch.isnan(loc).any()
tensor(True, device='mps:0')

actually is the right inner reduce which causes the error:

torch.isnan(reduce(target * observed_mask, "... seq dim -> ... 1 seq", "sum")).any()
tensor(True, device='mps:0')

Edit: More details...

I discovered that some values of my target are loaded as inf for some strange reason...

target[4,62,15]
tensor(inf, device='mps:0')

Abelarm · 2024-04-12T07:51:28Z

Sorry for my previous message @gorold, I should have digged a bit more before writing here. 😢
That problem is solved but now I am stuck with:

distribution Categorical(logits: torch.Size([32, 512, 128, 4])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[[[nan, nan, nan, nan],
          [nan, nan, nan, nan],
          [nan, nan, nan, nan],
          ...,

It appears randomly after some steps (can be even after one epoch).
My debugging discovered the root of the problem in the module: MultiInSizeLinear. At some points the self.weight are all nan so the forward of this module return nan...
the strange thing is that sometimes it could actually go for more than 100 steps before breaking.

Do you have an idea?

gorold · 2024-04-12T08:38:01Z

Hey @Abelarm, this should be caused by a backward pass performing a gradient update with a NaN value. It could be caused by an inf loss perhaps... One way to debug this to detect any bad values in the training_step and saving all inputs and weights.

Abelarm · 2024-04-12T10:01:14Z

I think we can say is not due to the inf loss, because I started logging the loss after each step, and the loss of the steps before the NaN in weights are the following..

Abelarm · 2024-04-12T10:10:37Z

It is pretty strange, for a fast debugging I just print out the sum of the weight for each feature of the module MultiInSizeLinear:

at step 53 we got a "normal" sum with a "normal" loss
but then the step after everything is NaN 😭

gorold · 2024-04-12T11:15:18Z

You could try adding the +trainer.detect_anomaly=True flag, the stack trace might be helpful

Abelarm · 2024-04-12T11:41:08Z

When run with the detect anomly I got:
Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function getKernelAxes, file GPUReductionOps.mm, line 31.
could this be the reason?

Keep in mind that I am under MacOS with arm and I needed to change a couple of double to long due to MPS not supporting double

gorold · 2024-04-12T16:34:56Z

I think running it on cpu would result in a more helpful error message. But I can't really help any further without being able to reproduce this error.

Abelarm · 2024-04-15T07:52:32Z

Just an heads-up on this:

All the problems come from the this: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

Then I tried to fix it by changing the line 46 of the file packed_scaler.py from target.double() to target.float() and everything went sideways.

So I don't know if you had in mind the support for native MacOS but for now it look like is not possible to train on apple hardware.

gorold · 2024-04-16T01:45:29Z

Thanks for the resolution! You could remove the .double() call locally if you need MPS, it's only required to handle time series with very large values.

I don't think we'll support MPS for now.

fmmoret · 2024-04-24T01:00:07Z

I have also gotten nan'd weights when doing pretraining on A100s and a 3090 ti when using the lotsa or gluonts datasets.
Many of the datasets are very sparse / nan heavy
0 0 0 0 0 ... actual data.

The distributions projections handle only matching patch sizes -- I think we're getting unlucky sometimes where there are a few good samples for patch size x, few good ones for patch size y, and then 1 sample for patch size y made of all 0s where basically all actual data got masked out.

I haven't narrowed it down yet, but I think something like this might be happening. All 0s leads to 0 variance and some of the distribution code divides by variance. I think it gets clamped to dtype's epsilon, but that might result in huge magnitudes that turn into infs & nan's somewhere.

They happen pretty rarely, so it's hard to repro reliably

gorold · 2024-04-24T06:15:30Z

Hey @fmmoret, thanks for reporting this, we've also seen this occasionally. Another possible reason this could be happening could be the attention layer, if all tokens are masked.

Re-opening this issue to track this pre-training issue.

chenghaoliu89 · 2024-12-22T12:44:13Z

I have also gotten nan'd weights when doing pretraining on A100s and a 3090 ti when using the lotsa or gluonts datasets. Many of the datasets are very sparse / nan heavy 0 0 0 0 0 ... actual data.

The distributions projections handle only matching patch sizes -- I think we're getting unlucky sometimes where there are a few good samples for patch size x, few good ones for patch size y, and then 1 sample for patch size y made of all 0s where basically all actual data got masked out.

I haven't narrowed it down yet, but I think something like this might be happening. All 0s leads to 0 variance and some of the distribution code divides by variance. I think it gets clamped to dtype's epsilon, but that might result in huge magnitudes that turn into infs & nan's somewhere.

They happen pretty rarely, so it's hard to repro reliably

For this case, a quick fix is to remove the outlier samples by adding one line

batch = [sample for sample in batch if (((sample['observed_mask'][(sample['prediction_mask']==False), :sample['patch_size'][0]])).any())
                 and (((sample['observed_mask'][(sample['prediction_mask']==True), :sample['patch_size'][0]])).any())]

in loader.py line 107

Abelarm changed the title ~~Training errors during finetuning~~ Training errors during finetuning - tensor errors Apr 11, 2024

gorold closed this as completed Apr 16, 2024

gorold reopened this Apr 24, 2024

gorold added the bug Something isn't working label Apr 24, 2024

gorold changed the title ~~Training errors during finetuning - tensor errors~~ Training errors - nans Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training errors - nans #19

Training errors - nans #19

Abelarm commented Apr 11, 2024

gorold commented Apr 11, 2024

Abelarm commented Apr 11, 2024 •

edited

Loading

Abelarm commented Apr 12, 2024 •

edited

Loading

gorold commented Apr 12, 2024

Abelarm commented Apr 12, 2024 •

edited

Loading

Abelarm commented Apr 12, 2024

gorold commented Apr 12, 2024 •

edited

Loading

Abelarm commented Apr 12, 2024 •

edited

Loading

gorold commented Apr 12, 2024

Abelarm commented Apr 15, 2024 •

edited

Loading

gorold commented Apr 16, 2024

fmmoret commented Apr 24, 2024 •

edited

Loading

gorold commented Apr 24, 2024

chenghaoliu89 commented Dec 22, 2024

Training errors - nans #19

Training errors - nans #19

Comments

Abelarm commented Apr 11, 2024

gorold commented Apr 11, 2024

Abelarm commented Apr 11, 2024 • edited Loading

Abelarm commented Apr 12, 2024 • edited Loading

gorold commented Apr 12, 2024

Abelarm commented Apr 12, 2024 • edited Loading

Abelarm commented Apr 12, 2024

gorold commented Apr 12, 2024 • edited Loading

Abelarm commented Apr 12, 2024 • edited Loading

gorold commented Apr 12, 2024

Abelarm commented Apr 15, 2024 • edited Loading

gorold commented Apr 16, 2024

fmmoret commented Apr 24, 2024 • edited Loading

gorold commented Apr 24, 2024

chenghaoliu89 commented Dec 22, 2024

Abelarm commented Apr 11, 2024 •

edited

Loading

Abelarm commented Apr 12, 2024 •

edited

Loading

Abelarm commented Apr 12, 2024 •

edited

Loading

gorold commented Apr 12, 2024 •

edited

Loading

Abelarm commented Apr 12, 2024 •

edited

Loading

Abelarm commented Apr 15, 2024 •

edited

Loading

fmmoret commented Apr 24, 2024 •

edited

Loading