Skip to content

[BUG] preserve grad_fn when replacing non-finite loss in _update_losses_and_lengths#2290

Open
kpal002 wants to merge 3 commits into
sktime:mainfrom
kpal002:fix/nan-loss-grad-fn
Open

[BUG] preserve grad_fn when replacing non-finite loss in _update_losses_and_lengths#2290
kpal002 wants to merge 3 commits into
sktime:mainfrom
kpal002:fix/nan-loss-grad-fn

Conversation

@kpal002

@kpal002 kpal002 commented May 22, 2026

Copy link
Copy Markdown

What does this fix?

When a batch produces a NaN or Inf loss (e.g. PoissonLoss on unconstrained NBeats outputs with backcast_loss_ratio > 0), _update_losses_and_lengths replaces it with:

losses = torch.tensor(1e9, device=losses.device)

torch.tensor(...) creates a new leaf tensor with requires_grad=False and no grad_fn. When Lightning subsequently calls loss.backward(), PyTorch raises:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

This is the root cause of the CI failure:

FAILED pytorch_forecasting/tests/test_all_estimators.py::TestAllPtForecasters::test_integration[NBeats-base_params-1-PoissonLoss]

Fix

Replace torch.tensor(1e9) with torch.nan_to_num(losses, nan=1e9, posinf=1e9, neginf=-1e9).

torch.nan_to_num replaces non-finite values while keeping the existing grad_fn connected, so backward() can be called. The gradient of nan_to_num at non-finite inputs is 0 by convention, meaning no parameter update happens for the affected batch — the same intended behaviour as before, but without breaking autograd.

Verified locally:

x = torch.tensor(1e38, requires_grad=True)
losses = (x * x).sum()          # Inf, has grad_fn
fixed = torch.nan_to_num(losses, posinf=1e9)
fixed.backward()                 # OK — gradient on x is 0.0

PR checklist

  • Title starts with [BUG]
  • No new dependencies
  • No behaviour change for finite losses

…es_and_lengths

When the summed loss is NaN or Inf (e.g. PoissonLoss on unconstrained NBeats
outputs), the previous code replaced it with torch.tensor(1e9).  That creates
a new leaf tensor with requires_grad=False and no grad_fn, so Lightning's
subsequent loss.backward() call raises:

    RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

This manifests as a CI failure for NBeats + PoissonLoss + backcast_loss_ratio=1.0.

Fix: use torch.nan_to_num instead of torch.tensor, which replaces the
non-finite value while keeping the existing grad_fn connected.  The gradient
of nan_to_num at non-finite inputs is 0 by convention, so no parameter update
happens for the affected batch - the same intended behaviour as before, but
without breaking autograd.
@codecov

codecov Bot commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@859b9ee). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2290   +/-   ##
=======================================
  Coverage        ?   87.13%           
=======================================
  Files           ?      167           
  Lines           ?     9751           
  Branches        ?        0           
=======================================
  Hits            ?     8497           
  Misses          ?     1254           
  Partials        ?        0           
Flag Coverage Δ
cpu 87.13% <100.00%> (?)
pytest 87.13% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@phoeenniixx phoeenniixx left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I have a doubt though:
The PR description says NBeats fails due to this, can you please verify how? I dont see this failure on the main?
Is there any issue related to this?

@kpal002

kpal002 commented May 23, 2026

Copy link
Copy Markdown
Author

You are correct that NBeats-base_params-1-PoissonLoss passes on main, so this is not a reliably reproducible failure there.

What actually happened: The failure appears to be non-deterministic / numerically flaky. The training logs in the failing CI run showed the loss diverging rapidly (val_loss = -5.53e+3 in epoch 0, then -3.63e+3 in epoch 1). When that happens:

  1. F.poisson_nll_loss produces Inf (NBeats outputs are unconstrained; PoissonLoss does not clip them before the NLL computation)
  2. _update_losses_and_lengths catches the Inf via if not torch.isfinite(losses.sum())
  3. It replaces the loss with torch.tensor(1e9) — a new leaf tensor with requires_grad=False and no grad_fn
  4. That tensor flows back to loss.backward() in Lightning → RuntimeError

The latent bug in step 3 is real and can be verified by code inspection alone — torch.tensor(...) always creates a leaf with no grad. But it is only triggered when a run happens to diverge enough to produce Inf losses, which is more likely for loss functions like PoissonLoss paired with unconstrained models like NBeats.

So the correct characterisation is: this is a latent bug that rarely fires on well-behaved runs (main passes) but can crash any run where the loss goes non-finite.

There is no existing issue for this. I can open one if that would help track it, and I am happy to update the PR description to accurately reflect this rather than claiming the test consistently fails on main.

Sorry for the misleading original description.

@kpal002 kpal002 requested a review from phoeenniixx June 2, 2026 06:51
@phoeenniixx

Copy link
Copy Markdown
Member

How sure are we that this IS a problem? I am sorry I am not able to reproduce it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants