Skip to content

ci: Update transformers to latest version 5.11.0#2518

Open
svcnvidia-nemo-ci wants to merge 5 commits into
mainfrom
transformers_bump_5.11.0
Open

ci: Update transformers to latest version 5.11.0#2518
svcnvidia-nemo-ci wants to merge 5 commits into
mainfrom
transformers_bump_5.11.0

Conversation

@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

beep boop 🤖: Updating transformers to latest version on pypi

svcnvidia-nemo-ci and others added 5 commits June 9, 2026 12:04
ci: update package version to 0.5.0 (#2472)

Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
feat: make mesh accept meshcontext (#2266)

* make mesh accept meshcontext



* fix(transformers): resolve mesh context inputs



* rm



* use moe_overrides



* make create_mesh_context the entry point for dist setup



* fix: add renamed distributed utility files



* fix(vlm): complete dist_setup -> mesh_context rename

Two leftover references to the old setup_distributed/dist_setup API were
missed when the recipe was migrated to create_mesh_context_from_config:

- nemo_automodel/recipes/vlm/finetune.py:794 still read
  self.dist_setup.cp_size, which would AttributeError on any PP+CP VLM run.
- tests/unit_tests/recipes/test_finetune_vlm_cp_wiring.py monkeypatched
  the stale symbol "setup_distributed", causing three parametrizations of
  test_setup_skips_pp_media_prechunk_when_cp_preembeds_vlm_inputs to fail
  during pytest setup with AttributeError.



* remove activation checkpointing from meshcontext



* refac



* dedup



* fix



* fix



* Update nemo_automodel/_diffusers/auto_diffusion_pipeline.py





* Update skills/distributed-training/SKILL.md





* Update skills/distributed-training/SKILL.md





* Update nemo_automodel/components/distributed/config.py





* Update skills/nemo-automodel-distributed-training/SKILL.md



* docs(distributed): update setup API example



* Update nemo_automodel/_diffusers/auto_diffusion_pipeline.py



* fix(diffusers): remove duplicated DistributedSetup.build call in fsdp2 path

Commit 6acf01f left a duplicated `distributed_setup = DistributedSetup.build(`
line and a stray extra `)` in the fsdp2 branch of the parallelism-manager
builder. This broke `ruff format --check` (CI linting job) and, once formatted,
would have nested the call as `DistributedSetup.build(distributed_setup=...)` —
a kwarg the builder does not accept. Restore the single build() call matching
the ddp branch.



* fix(vlm): pass distributed_setup through Gemma4 joint drafter composite

The distributed-config API refactor routes all distributed settings through a
single ``distributed_setup`` argument and rejects the separate ``moe_mesh`` /
``distributed_config`` / ``pipeline_config`` kwargs in
``NeMoAutoModel*.from_pretrained``. ``Gemma4WithDrafter.from_pretrained`` was
still forwarding those separate kwargs to its inner base/drafter loads, so the
joint-drafter VLM finetune (L2_HF_Transformer_VLM_Gemma4_Joint_Drafter) raised
``TypeError: Distributed settings must be passed with distributed_setup``.

Forward ``distributed_setup`` to both sub-loaders instead, and extend the
pipeline/context-parallel safety guards to read pp_size/cp_size from the
resolved setup so the KV-sharing invariant holds on the recipe path too.




* feat: add use_memory_efficient_lora knob (#2239)

* add use_memory_efficient_lora knob



* add use_memory_efficient_lora



* fix



* delete sft gpt-oss 20b single gpu



* add nemotron nano v3 single-gpu lora example



* add grad ckpt



* add fused lora mlp



* fix(checkpoint): support single-GPU Nemotron-H MoE LoRA checkpoint load

Loading a merged-expert Nemotron-H MoE checkpoint through the default DCP / set_model_state_dict path transiently materializes a second on-device copy of the expert weights, which OOMs a 30B-class model on a single 80GB GPU.

Checkpointer.load now routes single-device custom-model safetensors through the frugal full-state path (load to CPU, merge from_hf on CPU, copy into the model), keeping device memory at ~model size.

_load_full_state_dict_into_model normalizes stray real (CPU) buffers left behind by custom-model meta materialization onto the parameter device (avoids 'Multiple devices found'), and uses plain load_state_dict for non-DTensor models so the full state dict is not moved on-device a second time.

Adds a [nemotron-singlegpu-lora] note plus per-site tags documenting these single-device special cases, links the exercising recipe (examples/llm_finetune/nemotron/nemotron_nano_v3_singlegpu_lora.yaml), and flags the load path for a future refactor.



* feat(peft): add fused LoRA SwiGLU/ReLU² MLP with recompute backward

Fuses gate+up+down+activation into a single autograd Function that saves only (x, gate_out, up_out) and recomputes the activation and down-projection input in backward, roughly halving MLP activation memory at equal speed during LoRA SFT.

SwiGLU forward/backward use elementwise Triton kernels (with in-place backward buffer reuse) and a pure-torch fallback when Triton is unavailable; matmuls stay on cuBLAS. Covers SiLU-SwiGLU (gate/up/down) and non-gated ReLU² (e.g. Nemotron-H dense) MLPs.

install_fused_lora_mlp() swaps each LoRA-applied MLP's forward and falls back to the per-linear path at runtime under DTensor (TP/EP), DoRA, or active dropout, keeping it correct under sharding. Already wired from lora.py; opt out via NEMO_AUTOMODEL_DISABLE_FUSED_LORA_MLP=1.

Activation recompute follows Megatron-Core's SwiGLUFunction; the fused LoRA-MLP and in-place buffer reuse follow Unsloth's LoRA_MLP (both Apache-2.0).



* refactor(peft): drop NEMO_AUTOMODEL_DISABLE_FUSED_LORA_MLP env knob

The fused LoRA MLP can already be disabled via the use_memory_efficient_lora
config flag, and fusion auto-falls-back per-MLP under DTensor / DoRA / active
dropout. The env var was a redundant escape hatch; remove it and the now-unused
os import.



* test(checkpoint): align custom-model load-routing guard with single-device fast path

The nemotron-singlegpu-lora change routes single-device (world_size == 1) custom
safetensors models through the frugal full-state fast path instead of DCP. The fast
path now applies the state_dict_adapter from_hf conversion on CPU
(_maybe_adapt_state_dict_from_hf), so custom MoE expert merging still happens — the
guard test's original premise (fast path bypasses conversion) no longer holds.

- Reframe test_custom_model_skips_fast_path_uses_dcp as the multi-rank (sharded)
  case (WORLD_SIZE=2), where DCP per-rank DTensor slicing is genuinely required.
- Add test_single_device_custom_model_uses_fast_path covering the new world_size==1
  behavior (fast path used, DCP not).



---------



* fix(deepseek_v3): initialize weights in fp32 and default router to fp32 (#2450)

* fix(deepseek_v3): init weights in fp32, default router to fp32

Sampling the random init directly in bf16 distorts the variance/mean schedule and
produces exploding first-step gradients (flat/diverging loss) for from-scratch
pretraining. Add an init_weights_in_fp32 context manager that samples in fp32 and
casts back to the resident dtype, and use it in DeepSeek-V3 initialize_weights.
Also default the router (gate_precision) to fp32 to match the HF reference.



* refactor(models): rename init_weights_in_fp32 to yield_fp32_model

Generalize the context manager per review: it's a generic "run this block with
the model in fp32" tool, not init-specific. Yield the model and make the exit
dtype optional (defaults to the model's pre-context float dtype).



---------



* fix(multimodal): migrate finetune recipe to DistributedSetup/MeshContext API

The auto-class-public-api refactor deleted `recipes/_dist_setup.py` and moved recipes
to the `DistributedSetup` / `MeshContext` API, but `multimodal/finetune.py` was left
importing and calling the deleted `_dist_setup.setup_distributed`, so importing the
module raised ModuleNotFoundError and broke the import-check in every Pip/UV install job.

Migrate it to the shared `_distributed_setup_attributes(create_distributed_setup_from_config(...))`
pattern used by the llm/vlm recipes: unpack distributed_setup / mesh_context /
distributed_config / device_mesh / moe_mesh / pp_enabled / pipeline_config /
moe_parallel_config / activation_checkpointing, and update the model-build calls
(`mesh=self.mesh_context`, `moe_config`/`cfg_moe=self.moe_parallel_config`,
`activation_checkpointing=self.activation_checkpointing`).




* feat(speculative): add EAGLE-3 sequence packing and reasoning-mode control (#2444)

* feat(speculative): add reasoning mode control for EAGLE/P-EAGLE/DFlash training

Add --reasoning {none,save,disable} flag to regenerate.py for controlling
whether target model reasoning content is preserved or suppressed during
data regeneration. Add mask_reasoning_content option to EAGLE/P-EAGLE/DFlash
training recipes to exclude reasoning traces from the loss mask.





* feat(speculative): add EAGLE-3 sequence packing for draft training

Pack variable-length chat samples into fixed-width rows for EAGLE-3
training, removing the per-sample padding waste of the default
max_length path. Documents within a row attend block-causally: the
target uses a 4D block-causal mask (SDPA) and the draft uses varlen
FlashAttention-2; cross-document TTT supervision is gated by
doc_remaining so deeper steps never leak across boundaries. Opt-in via
packed_sequence_size > 0, colocated target backend only. Covered by
unit tests plus an FA2-vs-eager parity test.





---------






* feat(distributed): add selective activation checkpointing for FSDP2 (#2389)

* feat(distributed): add selective activation checkpointing for FSDP2



* fix(distributed): support selective activation checkpointing with torch.compile



* docs(fern): drop selective AC from frozen v0.4 snapshot



* feat(distributed): honor selective activation checkpointing on single GPU



* feat(moe): support selective activation checkpointing with expert parallelism



* fix(model): make DeepSeek MLP dispatch wrapper-safe



* fix(distributed): save expert grouped-GEMM in selective AC and add op trace



* feat(moe): compile selective activation checkpointing wrappers outer



* refactor(distributed): move selective AC into its own module

Extract the TorchTitan-style selective activation checkpointing core out of
the central parallelizer.py into a dedicated activation_checkpointing.py:
op-set construction, the save/recompute policy, block/sub-module wrappers,
KV-sharing detection, and the compile-outer wrapper flag. parallelizer.py
keeps only the thin apply_selective_activation_checkpointing entry point,
which still needs the heavy, transformers-aware _extract_model_layers, so the
dependency stays one-directional (parallelizer -> activation_checkpointing ->
parallelizer_utils) with no circular imports.

Move the opt-in NEMO_SELECTIVE_AC_TRACE diagnostic out of parallelizer.py into
parallelizer_utils.maybe_trace_selective_ac_decision so the hot policy is a
single call site instead of trace globals plus a helper.

Make the new module's cross-module interface public (drop the leading
underscore) and keep internal op-resolution/plumbing private. Update the moe
and fsdp2 consumers and the unit tests to import from the new module.

Also fix doc wording: clarify that torch.compile must be held fixed when
comparing full vs. selective, and refer to TorchTitan as a reference
implementation rather than "upstream".



* refactor(distributed): move selective-AC trace into the AC module



* test(distributed): patch activation_checkpointing.checkpoint_wrapper after AC module split



* docs: apply tech-writer edits to gradient-checkpointing guide



---------



* feat(diffusion): improve qwen image finetuning configs (#2442)




* ci: add nemo-run, split qwen-vl-utils from decord for arm (#2456)

* ci: add nemo-run, split qwen-vl-utils from decord for arm




* fix: override in pytorch container




* Update uv lock



---------






* Apply suggestions from code review




* fix(precision): dtype contract bug fixes for FSDP2 mixed-dtype loads (#2419)

* fix(transformers): unify loaded HF dtype via promote_types

Make _restore_loaded_model_dtype dtype-aware: instead of always restoring to
the checkpoint dtype, unify each floating tensor to promote_types(checkpoint,
requested). This honors an explicit fp32 request while preserving
intrinsically-fp32 checkpoint params (e.g. A_log) under a bf16 request, and is
a no-op for the bf16/auto path. Fixes FSDP2 uniform-dtype tripping on
HF mixed-dtype loads.



* feat(distributed): default pipeline dtype to FSDP activation dtype

When pipeline parallelism is enabled and pipeline.dtype is unset, derive it from
the FSDP mixed-precision activation dtype (mp_policy.output_dtype, falling back to
param_dtype) so pipeline stage shape inference matches the real activation dtype
(e.g. bf16 compute under fp32 master weights). An explicitly set pipeline.dtype is
honored but warned on mismatch, since it can corrupt inter-stage recv buffers.
No-ops for strategies without an mp_policy (e.g. MegatronFSDP) and for pp_size==1.


(cherry picked from commit 3f6b246)

* refactor(distributed): resolve FSDP compute dtype per-param, decoupled from storage

fully_shard_by_dtype now groups parameters by their required *compute* dtype
instead of their storage dtype, so fp32 master weights (uniform fp32 storage)
still compute the bulk in mp_policy.param_dtype (bf16) while intrinsically-fp32
params keep fp32 compute.

Per-parameter compute dtype is resolved by precedence: pinned fp32
(_keep_in_fp32_modules_strict) > HF-recorded checkpoint dtype (tagged onto each
tensor at load time in _restore_loaded_model_dtype) > mp_policy.param_dtype.
Qwen3.5's GatedDeltaNet fp32 holder is declared via patch_hf_model; the
NemotronH and Qwen3.5 strategies thread the declaration through.


(cherry picked from commit 3dd6b97)

* docs(model-onboarding): document _keep_in_fp32_modules_strict contract

Add SKILL.md §2.6 explaining which params must compute in fp32 (SSM A_log/
dt_bias/D, MoE sigmoid-gate bias, attention-sink bias, scale), how to declare
them (class attribute vs patch_hf_model instance attribute), and why the pin is
the robust signal across all load paths. Broaden the MoE checklist item and
code comment accordingly.


(cherry picked from commit a11db38)

* test(distributed): add fp32 compute-dtype contract test

Assert the resident compute dtype of every trainable parameter across the model
archetypes that use fully_shard_by_dtype (dense, Qwen3.5-style hybrid), covering
the full precedence chain: pinned fp32 > HF-recorded dtype > mp_policy.param_dtype,
under fp32 master weights and ordinary loads.


(cherry picked from commit dc83926)

* feat(model): cast frozen modules to compute dtype to avoid mismatch


(cherry picked from commit d321f5e)

* refactor(gemma4): drop projector dtype hook now general frozen cast handles it


(cherry picked from commit 1bc67e2)

* feat(training): add dormant resolve_storage_dtype helper

Add resolve_storage_dtype() (and its unit tests) for defaulting model.torch_dtype
to fp32 for full-parameter torch.optim training. Not yet wired into recipes here;
the call sites are marked with breadcrumb comments and enabled in a follow-up PR,
keeping this PR limited to dtype bug fixes with no behavior/memory change.



* fix(model): cast frozen-module buffers and unsharded params to compute dtype



* docs(infra): correct frozen-tower FSDP comment to match sharding reality



* docs(mixed-precision): clarify TE vs torch AdamW memory and precision trade-offs



* docs(mixed-precision): apply tech writer edits



* docs(mixed-precision): drop unresolvable FSDP anchor



---------



* docs(speculative): add subsystem README, fold in regeneration guide (#2448)

Add examples/speculative/README.md covering the whole speculative-decoding
draft-training subsystem: supported methods (EAGLE-1/2/3/3.1, P-EAGLE,
DFlash), target-model registry coverage, compute backends (eager vs
flash_attention_2, flex_attention/sdpa, fused Triton soft cross-entropy,
d2t/t2d draft-vocab compression), target backends (co-located, remote,
offline cache), serving and benchmarking, inference-engine compatibility,
and a consolidated config reference.

Fold the standalone regenerate_with_target.md into the README's data
preparation section (full two-step flow, tuning table, pitfalls) and remove
the separate file so there is a single entry point.



* feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support (#2284)

* feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support



* fix the memory management for training large 14B wan model

* fix wan2.2 support

* all good for wan2.2

* update



* docs(fern): add Wan2.2 T2V-A14B model coverage and release log entry



* fix anther round of code review

* fix(diffusion): sort wan.py imports to satisfy CI isort (I001)



* fix(diffusion): load inference checkpoints to CPU to halve peak GPU memory

Avoids doubling peak GPU memory (and a potential OOM in Wan2.2 two-stage
inference) by loading EMA/consolidated state dicts with map_location="cpu";
load_state_dict copies into the already-on-device parameters.



---------





* test: include find_unused_parameters in ddp manager args expectation

The DDP strategy config exposes find_unused_parameters (default False),
so _build_diffusion_parallel_manager_args returns it in the ddp branch.
Update the test's expected dict to match, fixing the L0 unit test failure.




* fix(distributed): address Claude review comments

- infrastructure.py: forward the model wrapper's mp_policy (from FSDP2Config)
  to the MoE expert parallelizer when MoEParallelizerConfig.mp_policy is unset,
  so a custom precision policy isn't silently dropped for EP models.
- skills/nemo-automodel-distributed-training/SKILL.md: fix stale references —
  MeshContext no longer holds strategy_config/pipeline_config/moe_config and
  STRATEGY_MAP moved to _STRATEGY_MAP in config.py; MoEParallelizerConfig now
  lives in components/distributed/config.py.




---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Signed-off-by: Adil Asif <adasif@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: thyways <2484113689@qq.com>
Signed-off-by: khazic <khazzz1c@gmail.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: linnan wang <linnanw@nvidia.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Yuhe Zhang <yuhezhang.zju@gmail.com>
Co-authored-by: khazzz1c <khazzz1c@gmail.com>
Co-authored-by: thyways <2484113689@qq.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Pranav Thombre <pthombre@nvidia.com>
Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Co-authored-by: linnan wang <linnanw@nvidia.com>
feat(vlm): enable Qwen3.5 MoE VLM CP (#2432)

* feat(vlm): enable Qwen3.5 MoE VLM CP



* test(vlm): cover Qwen3.5 MoE VLM CP changes

Add unit coverage for the new/changed code paths in PR #2432:
- cp_utils: opt-in seq_index CP buffer, singleton expansion, arange-continued padding
- Qwen3_5MoeBlock.forward: seq_index threading into linear_attn, stripping on full-attn path
- prepare_model_inputs_for_cp / _pre_embed_only dispatch and text-only forward path
- PreTokenizedDatasetWrapper inject_fake_images gating + build_dataloader passthrough
- _run_validation_epoch: total_tokens not summed over CP ranks




* style(vlm): sort imports in qwen3_5_moe model.py

Fixes ruff I001 (unsorted import block) flagged by CI linting:
`import inspect` was added above `import copy`.




* refactor(qwen): keep CP seq index out of cp utils



* rename qwen medpix cp2 config



* test(qwen): align CP seq-index tests with cp_linear_attn refactor

The "keep CP seq index out of cp utils" refactor moved seq_index handling
out of make_cp_batch_and_ctx and prepare_model_inputs_for_cp into
CPAwareGatedDeltaNet. Update tests accordingly:
- drop obsolete seq_index buffer/padding tests from test_cp_utils
- prepare_model_inputs_for_cp now returns only inputs_embeds + position_ids
- rewrite TestExtractLocalPositions -> TestExtractLocalSeqIndex for the new
  _extract_local_seq_index signature
- add coverage for _build_dual_chunk_local_positions (DualChunkSwap layout)




---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(model): flux2 (#2145)

* flux2 init draft

* udpate

* fix(diffusion): revert flux1 example and add flux2 inference config

- Revert accidental FLUX.2 changes in flux_t2i_flow.yaml back to FLUX.1-dev
- Add examples/diffusion/generate/configs/generate_flux2.yaml for FLUX.2-dev inference




* fix(diffusion): fix flux2 contiguity and text encoder eval

- Add .contiguous() after permute in _pack_latents and _unpack_latents
  so hidden_states is always contiguous before flash-attention kernel
- Call pipeline.text_encoder.eval() after device placement, consistent
  with FluxProcessor, WanProcessor, and QwenImageProcessor




* feat(diffusion): sync flux2 configs with main performance fields

Add optimizer flags (foreach/fused), performance block, FSDP2 prefetch
tuning, save_checkpoint_every_epoch, and save_consolidated=final to
flux2_t2i_flow.yaml and flux2_t2i_flow_lora.yaml to match the fields
added to flux_t2i_flow.yaml in main.




* fix(diffusion): fix flux2 cfg dropout to apply per-sample not per-batch

Replace single random.random() gate (correlated across entire batch) with
a per-sample Bernoulli mask so each sample independently has
cfg_dropout_prob chance of receiving zeroed text embeddings. Also drop
the now-unused `import random`.




* fix(diffusion): fix flux cfg dropout to apply per-sample not per-batch

Replace single random.random() gate (correlated across entire batch) with
a per-sample Bernoulli mask so each sample independently has
cfg_dropout_prob chance of receiving zeroed text/pooled embeddings.
Also drop the now-unused `import random`.




* test(diffusion): add unit tests for Flux2Adapter and Flux2Processor

- tests/unit_tests/flow_matching/test_flux2_adapter.py: 36 tests covering
  pack/unpack roundtrip + contiguity, 4D positional IDs (img_ids/txt_ids)
  shape/dtype/value correctness, prepare_inputs keys/shapes/normalization/
  CFG dropout, and forward model call kwargs
- tests/unit_tests/diffusion_processors/test_flux2_processor.py: 22 tests
  covering model_type/default_model_name properties, encode_image BN
  normalization + dtype + squeeze, encode_text Mistral3 args + no-clip keys,
  verify_latent shape/NaN/Inf checks, get_cache_data structure, and
  ProcessorRegistry lookup




---------

Signed-off-by: linnan wang <linnanw@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: linnan wang <linnanw@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: gitlab-runner <gitlab-runner@nvidia.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested review from a team and jgerh as code owners June 11, 2026 08:20
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant