[BUG] preserve grad_fn when replacing non-finite loss in _update_losses_and_lengths by kpal002 · Pull Request #2290 · sktime/pytorch-forecasting

kpal002 · 2026-05-22T20:10:23Z

What does this fix?

When a batch produces a NaN or Inf loss (e.g. PoissonLoss on unconstrained NBeats outputs with backcast_loss_ratio > 0), _update_losses_and_lengths replaces it with:

losses = torch.tensor(1e9, device=losses.device)

torch.tensor(...) creates a new leaf tensor with requires_grad=False and no grad_fn. When Lightning subsequently calls loss.backward(), PyTorch raises:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

This is the root cause of the CI failure:

FAILED pytorch_forecasting/tests/test_all_estimators.py::TestAllPtForecasters::test_integration[NBeats-base_params-1-PoissonLoss]

Fix

Replace torch.tensor(1e9) with torch.nan_to_num(losses, nan=1e9, posinf=1e9, neginf=-1e9).

torch.nan_to_num replaces non-finite values while keeping the existing grad_fn connected, so backward() can be called. The gradient of nan_to_num at non-finite inputs is 0 by convention, meaning no parameter update happens for the affected batch — the same intended behaviour as before, but without breaking autograd.

Verified locally:

x = torch.tensor(1e38, requires_grad=True)
losses = (x * x).sum()          # Inf, has grad_fn
fixed = torch.nan_to_num(losses, posinf=1e9)
fixed.backward()                 # OK — gradient on x is 0.0

PR checklist

Title starts with [BUG]
No new dependencies
No behaviour change for finite losses

…es_and_lengths When the summed loss is NaN or Inf (e.g. PoissonLoss on unconstrained NBeats outputs), the previous code replaced it with torch.tensor(1e9). That creates a new leaf tensor with requires_grad=False and no grad_fn, so Lightning's subsequent loss.backward() call raises: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn This manifests as a CI failure for NBeats + PoissonLoss + backcast_loss_ratio=1.0. Fix: use torch.nan_to_num instead of torch.tensor, which replaces the non-finite value while keeping the existing grad_fn connected. The gradient of nan_to_num at non-finite inputs is 0 by convention, so no parameter update happens for the affected batch - the same intended behaviour as before, but without breaking autograd.

codecov · 2026-05-22T21:05:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@859b9ee). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2290   +/-   ##
=======================================
  Coverage        ?   87.13%           
=======================================
  Files           ?      167           
  Lines           ?     9751           
  Branches        ?        0           
=======================================
  Hits            ?     8497           
  Misses          ?     1254           
  Partials        ?        0

Flag	Coverage Δ
cpu	`87.13% <100.00%> (?)`
pytest	`87.13% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

phoeenniixx

Thanks!
I have a doubt though:
The PR description says NBeats fails due to this, can you please verify how? I dont see this failure on the main?
Is there any issue related to this?

kpal002 · 2026-05-23T17:45:34Z

You are correct that NBeats-base_params-1-PoissonLoss passes on main, so this is not a reliably reproducible failure there.

What actually happened: The failure appears to be non-deterministic / numerically flaky. The training logs in the failing CI run showed the loss diverging rapidly (val_loss = -5.53e+3 in epoch 0, then -3.63e+3 in epoch 1). When that happens:

F.poisson_nll_loss produces Inf (NBeats outputs are unconstrained; PoissonLoss does not clip them before the NLL computation)
_update_losses_and_lengths catches the Inf via if not torch.isfinite(losses.sum())
It replaces the loss with torch.tensor(1e9) — a new leaf tensor with requires_grad=False and no grad_fn
That tensor flows back to loss.backward() in Lightning → RuntimeError

The latent bug in step 3 is real and can be verified by code inspection alone — torch.tensor(...) always creates a leaf with no grad. But it is only triggered when a run happens to diverge enough to produce Inf losses, which is more likely for loss functions like PoissonLoss paired with unconstrained models like NBeats.

So the correct characterisation is: this is a latent bug that rarely fires on well-behaved runs (main passes) but can crash any run where the loss goes non-finite.

There is no existing issue for this. I can open one if that would help track it, and I am happy to update the PR description to accurately reflect this rather than claiming the test consistently fails on main.

Sorry for the misleading original description.

phoeenniixx · 2026-06-02T07:24:34Z

How sure are we that this IS a problem? I am sorry I am not able to reproduce it

kpal002 requested review from PranavBhatP, benHeid, fkiraly, fnhirwa, jdb78, phoeenniixx and yarnabrina as code owners May 22, 2026 20:10

phoeenniixx requested changes May 23, 2026

View reviewed changes

kpal002 added 2 commits May 26, 2026 08:38

Merge branch 'main' into fix/nan-loss-grad-fn

271694c

Merge branch 'main' into fix/nan-loss-grad-fn

cdc841c

kpal002 requested a review from phoeenniixx June 2, 2026 06:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] preserve grad_fn when replacing non-finite loss in _update_losses_and_lengths#2290

[BUG] preserve grad_fn when replacing non-finite loss in _update_losses_and_lengths#2290
kpal002 wants to merge 3 commits into
sktime:mainfrom
kpal002:fix/nan-loss-grad-fn

kpal002 commented May 22, 2026

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

phoeenniixx left a comment •

edited

Loading

Uh oh!

kpal002 commented May 23, 2026 •

edited

Loading

Uh oh!

phoeenniixx commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

kpal002 commented May 22, 2026

What does this fix?

Fix

PR checklist

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

phoeenniixx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpal002 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phoeenniixx commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 22, 2026 •

edited

Loading

phoeenniixx left a comment •

edited

Loading

kpal002 commented May 23, 2026 •

edited

Loading