ci: Update transformers to latest version 5.11.0#2518
Open
svcnvidia-nemo-ci wants to merge 5 commits into
Open
ci: Update transformers to latest version 5.11.0#2518svcnvidia-nemo-ci wants to merge 5 commits into
svcnvidia-nemo-ci wants to merge 5 commits into
Conversation
feat: make mesh accept meshcontext (#2266) * make mesh accept meshcontext * fix(transformers): resolve mesh context inputs * rm * use moe_overrides * make create_mesh_context the entry point for dist setup * fix: add renamed distributed utility files * fix(vlm): complete dist_setup -> mesh_context rename Two leftover references to the old setup_distributed/dist_setup API were missed when the recipe was migrated to create_mesh_context_from_config: - nemo_automodel/recipes/vlm/finetune.py:794 still read self.dist_setup.cp_size, which would AttributeError on any PP+CP VLM run. - tests/unit_tests/recipes/test_finetune_vlm_cp_wiring.py monkeypatched the stale symbol "setup_distributed", causing three parametrizations of test_setup_skips_pp_media_prechunk_when_cp_preembeds_vlm_inputs to fail during pytest setup with AttributeError. * remove activation checkpointing from meshcontext * refac * dedup * fix * fix * Update nemo_automodel/_diffusers/auto_diffusion_pipeline.py * Update skills/distributed-training/SKILL.md * Update skills/distributed-training/SKILL.md * Update nemo_automodel/components/distributed/config.py * Update skills/nemo-automodel-distributed-training/SKILL.md * docs(distributed): update setup API example * Update nemo_automodel/_diffusers/auto_diffusion_pipeline.py * fix(diffusers): remove duplicated DistributedSetup.build call in fsdp2 path Commit 6acf01f left a duplicated `distributed_setup = DistributedSetup.build(` line and a stray extra `)` in the fsdp2 branch of the parallelism-manager builder. This broke `ruff format --check` (CI linting job) and, once formatted, would have nested the call as `DistributedSetup.build(distributed_setup=...)` — a kwarg the builder does not accept. Restore the single build() call matching the ddp branch. * fix(vlm): pass distributed_setup through Gemma4 joint drafter composite The distributed-config API refactor routes all distributed settings through a single ``distributed_setup`` argument and rejects the separate ``moe_mesh`` / ``distributed_config`` / ``pipeline_config`` kwargs in ``NeMoAutoModel*.from_pretrained``. ``Gemma4WithDrafter.from_pretrained`` was still forwarding those separate kwargs to its inner base/drafter loads, so the joint-drafter VLM finetune (L2_HF_Transformer_VLM_Gemma4_Joint_Drafter) raised ``TypeError: Distributed settings must be passed with distributed_setup``. Forward ``distributed_setup`` to both sub-loaders instead, and extend the pipeline/context-parallel safety guards to read pp_size/cp_size from the resolved setup so the KV-sharing invariant holds on the recipe path too. * feat: add use_memory_efficient_lora knob (#2239) * add use_memory_efficient_lora knob * add use_memory_efficient_lora * fix * delete sft gpt-oss 20b single gpu * add nemotron nano v3 single-gpu lora example * add grad ckpt * add fused lora mlp * fix(checkpoint): support single-GPU Nemotron-H MoE LoRA checkpoint load Loading a merged-expert Nemotron-H MoE checkpoint through the default DCP / set_model_state_dict path transiently materializes a second on-device copy of the expert weights, which OOMs a 30B-class model on a single 80GB GPU. Checkpointer.load now routes single-device custom-model safetensors through the frugal full-state path (load to CPU, merge from_hf on CPU, copy into the model), keeping device memory at ~model size. _load_full_state_dict_into_model normalizes stray real (CPU) buffers left behind by custom-model meta materialization onto the parameter device (avoids 'Multiple devices found'), and uses plain load_state_dict for non-DTensor models so the full state dict is not moved on-device a second time. Adds a [nemotron-singlegpu-lora] note plus per-site tags documenting these single-device special cases, links the exercising recipe (examples/llm_finetune/nemotron/nemotron_nano_v3_singlegpu_lora.yaml), and flags the load path for a future refactor. * feat(peft): add fused LoRA SwiGLU/ReLU² MLP with recompute backward Fuses gate+up+down+activation into a single autograd Function that saves only (x, gate_out, up_out) and recomputes the activation and down-projection input in backward, roughly halving MLP activation memory at equal speed during LoRA SFT. SwiGLU forward/backward use elementwise Triton kernels (with in-place backward buffer reuse) and a pure-torch fallback when Triton is unavailable; matmuls stay on cuBLAS. Covers SiLU-SwiGLU (gate/up/down) and non-gated ReLU² (e.g. Nemotron-H dense) MLPs. install_fused_lora_mlp() swaps each LoRA-applied MLP's forward and falls back to the per-linear path at runtime under DTensor (TP/EP), DoRA, or active dropout, keeping it correct under sharding. Already wired from lora.py; opt out via NEMO_AUTOMODEL_DISABLE_FUSED_LORA_MLP=1. Activation recompute follows Megatron-Core's SwiGLUFunction; the fused LoRA-MLP and in-place buffer reuse follow Unsloth's LoRA_MLP (both Apache-2.0). * refactor(peft): drop NEMO_AUTOMODEL_DISABLE_FUSED_LORA_MLP env knob The fused LoRA MLP can already be disabled via the use_memory_efficient_lora config flag, and fusion auto-falls-back per-MLP under DTensor / DoRA / active dropout. The env var was a redundant escape hatch; remove it and the now-unused os import. * test(checkpoint): align custom-model load-routing guard with single-device fast path The nemotron-singlegpu-lora change routes single-device (world_size == 1) custom safetensors models through the frugal full-state fast path instead of DCP. The fast path now applies the state_dict_adapter from_hf conversion on CPU (_maybe_adapt_state_dict_from_hf), so custom MoE expert merging still happens — the guard test's original premise (fast path bypasses conversion) no longer holds. - Reframe test_custom_model_skips_fast_path_uses_dcp as the multi-rank (sharded) case (WORLD_SIZE=2), where DCP per-rank DTensor slicing is genuinely required. - Add test_single_device_custom_model_uses_fast_path covering the new world_size==1 behavior (fast path used, DCP not). --------- * fix(deepseek_v3): initialize weights in fp32 and default router to fp32 (#2450) * fix(deepseek_v3): init weights in fp32, default router to fp32 Sampling the random init directly in bf16 distorts the variance/mean schedule and produces exploding first-step gradients (flat/diverging loss) for from-scratch pretraining. Add an init_weights_in_fp32 context manager that samples in fp32 and casts back to the resident dtype, and use it in DeepSeek-V3 initialize_weights. Also default the router (gate_precision) to fp32 to match the HF reference. * refactor(models): rename init_weights_in_fp32 to yield_fp32_model Generalize the context manager per review: it's a generic "run this block with the model in fp32" tool, not init-specific. Yield the model and make the exit dtype optional (defaults to the model's pre-context float dtype). --------- * fix(multimodal): migrate finetune recipe to DistributedSetup/MeshContext API The auto-class-public-api refactor deleted `recipes/_dist_setup.py` and moved recipes to the `DistributedSetup` / `MeshContext` API, but `multimodal/finetune.py` was left importing and calling the deleted `_dist_setup.setup_distributed`, so importing the module raised ModuleNotFoundError and broke the import-check in every Pip/UV install job. Migrate it to the shared `_distributed_setup_attributes(create_distributed_setup_from_config(...))` pattern used by the llm/vlm recipes: unpack distributed_setup / mesh_context / distributed_config / device_mesh / moe_mesh / pp_enabled / pipeline_config / moe_parallel_config / activation_checkpointing, and update the model-build calls (`mesh=self.mesh_context`, `moe_config`/`cfg_moe=self.moe_parallel_config`, `activation_checkpointing=self.activation_checkpointing`). * feat(speculative): add EAGLE-3 sequence packing and reasoning-mode control (#2444) * feat(speculative): add reasoning mode control for EAGLE/P-EAGLE/DFlash training Add --reasoning {none,save,disable} flag to regenerate.py for controlling whether target model reasoning content is preserved or suppressed during data regeneration. Add mask_reasoning_content option to EAGLE/P-EAGLE/DFlash training recipes to exclude reasoning traces from the loss mask. * feat(speculative): add EAGLE-3 sequence packing for draft training Pack variable-length chat samples into fixed-width rows for EAGLE-3 training, removing the per-sample padding waste of the default max_length path. Documents within a row attend block-causally: the target uses a 4D block-causal mask (SDPA) and the draft uses varlen FlashAttention-2; cross-document TTT supervision is gated by doc_remaining so deeper steps never leak across boundaries. Opt-in via packed_sequence_size > 0, colocated target backend only. Covered by unit tests plus an FA2-vs-eager parity test. --------- * feat(distributed): add selective activation checkpointing for FSDP2 (#2389) * feat(distributed): add selective activation checkpointing for FSDP2 * fix(distributed): support selective activation checkpointing with torch.compile * docs(fern): drop selective AC from frozen v0.4 snapshot * feat(distributed): honor selective activation checkpointing on single GPU * feat(moe): support selective activation checkpointing with expert parallelism * fix(model): make DeepSeek MLP dispatch wrapper-safe * fix(distributed): save expert grouped-GEMM in selective AC and add op trace * feat(moe): compile selective activation checkpointing wrappers outer * refactor(distributed): move selective AC into its own module Extract the TorchTitan-style selective activation checkpointing core out of the central parallelizer.py into a dedicated activation_checkpointing.py: op-set construction, the save/recompute policy, block/sub-module wrappers, KV-sharing detection, and the compile-outer wrapper flag. parallelizer.py keeps only the thin apply_selective_activation_checkpointing entry point, which still needs the heavy, transformers-aware _extract_model_layers, so the dependency stays one-directional (parallelizer -> activation_checkpointing -> parallelizer_utils) with no circular imports. Move the opt-in NEMO_SELECTIVE_AC_TRACE diagnostic out of parallelizer.py into parallelizer_utils.maybe_trace_selective_ac_decision so the hot policy is a single call site instead of trace globals plus a helper. Make the new module's cross-module interface public (drop the leading underscore) and keep internal op-resolution/plumbing private. Update the moe and fsdp2 consumers and the unit tests to import from the new module. Also fix doc wording: clarify that torch.compile must be held fixed when comparing full vs. selective, and refer to TorchTitan as a reference implementation rather than "upstream". * refactor(distributed): move selective-AC trace into the AC module * test(distributed): patch activation_checkpointing.checkpoint_wrapper after AC module split * docs: apply tech-writer edits to gradient-checkpointing guide --------- * feat(diffusion): improve qwen image finetuning configs (#2442) * ci: add nemo-run, split qwen-vl-utils from decord for arm (#2456) * ci: add nemo-run, split qwen-vl-utils from decord for arm * fix: override in pytorch container * Update uv lock --------- * Apply suggestions from code review * fix(precision): dtype contract bug fixes for FSDP2 mixed-dtype loads (#2419) * fix(transformers): unify loaded HF dtype via promote_types Make _restore_loaded_model_dtype dtype-aware: instead of always restoring to the checkpoint dtype, unify each floating tensor to promote_types(checkpoint, requested). This honors an explicit fp32 request while preserving intrinsically-fp32 checkpoint params (e.g. A_log) under a bf16 request, and is a no-op for the bf16/auto path. Fixes FSDP2 uniform-dtype tripping on HF mixed-dtype loads. * feat(distributed): default pipeline dtype to FSDP activation dtype When pipeline parallelism is enabled and pipeline.dtype is unset, derive it from the FSDP mixed-precision activation dtype (mp_policy.output_dtype, falling back to param_dtype) so pipeline stage shape inference matches the real activation dtype (e.g. bf16 compute under fp32 master weights). An explicitly set pipeline.dtype is honored but warned on mismatch, since it can corrupt inter-stage recv buffers. No-ops for strategies without an mp_policy (e.g. MegatronFSDP) and for pp_size==1. (cherry picked from commit 3f6b246) * refactor(distributed): resolve FSDP compute dtype per-param, decoupled from storage fully_shard_by_dtype now groups parameters by their required *compute* dtype instead of their storage dtype, so fp32 master weights (uniform fp32 storage) still compute the bulk in mp_policy.param_dtype (bf16) while intrinsically-fp32 params keep fp32 compute. Per-parameter compute dtype is resolved by precedence: pinned fp32 (_keep_in_fp32_modules_strict) > HF-recorded checkpoint dtype (tagged onto each tensor at load time in _restore_loaded_model_dtype) > mp_policy.param_dtype. Qwen3.5's GatedDeltaNet fp32 holder is declared via patch_hf_model; the NemotronH and Qwen3.5 strategies thread the declaration through. (cherry picked from commit 3dd6b97) * docs(model-onboarding): document _keep_in_fp32_modules_strict contract Add SKILL.md §2.6 explaining which params must compute in fp32 (SSM A_log/ dt_bias/D, MoE sigmoid-gate bias, attention-sink bias, scale), how to declare them (class attribute vs patch_hf_model instance attribute), and why the pin is the robust signal across all load paths. Broaden the MoE checklist item and code comment accordingly. (cherry picked from commit a11db38) * test(distributed): add fp32 compute-dtype contract test Assert the resident compute dtype of every trainable parameter across the model archetypes that use fully_shard_by_dtype (dense, Qwen3.5-style hybrid), covering the full precedence chain: pinned fp32 > HF-recorded dtype > mp_policy.param_dtype, under fp32 master weights and ordinary loads. (cherry picked from commit dc83926) * feat(model): cast frozen modules to compute dtype to avoid mismatch (cherry picked from commit d321f5e) * refactor(gemma4): drop projector dtype hook now general frozen cast handles it (cherry picked from commit 1bc67e2) * feat(training): add dormant resolve_storage_dtype helper Add resolve_storage_dtype() (and its unit tests) for defaulting model.torch_dtype to fp32 for full-parameter torch.optim training. Not yet wired into recipes here; the call sites are marked with breadcrumb comments and enabled in a follow-up PR, keeping this PR limited to dtype bug fixes with no behavior/memory change. * fix(model): cast frozen-module buffers and unsharded params to compute dtype * docs(infra): correct frozen-tower FSDP comment to match sharding reality * docs(mixed-precision): clarify TE vs torch AdamW memory and precision trade-offs * docs(mixed-precision): apply tech writer edits * docs(mixed-precision): drop unresolvable FSDP anchor --------- * docs(speculative): add subsystem README, fold in regeneration guide (#2448) Add examples/speculative/README.md covering the whole speculative-decoding draft-training subsystem: supported methods (EAGLE-1/2/3/3.1, P-EAGLE, DFlash), target-model registry coverage, compute backends (eager vs flash_attention_2, flex_attention/sdpa, fused Triton soft cross-entropy, d2t/t2d draft-vocab compression), target backends (co-located, remote, offline cache), serving and benchmarking, inference-engine compatibility, and a consolidated config reference. Fold the standalone regenerate_with_target.md into the README's data preparation section (full two-step flow, tuning table, pitfalls) and remove the separate file so there is a single entry point. * feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support (#2284) * feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support * fix the memory management for training large 14B wan model * fix wan2.2 support * all good for wan2.2 * update * docs(fern): add Wan2.2 T2V-A14B model coverage and release log entry * fix anther round of code review * fix(diffusion): sort wan.py imports to satisfy CI isort (I001) * fix(diffusion): load inference checkpoints to CPU to halve peak GPU memory Avoids doubling peak GPU memory (and a potential OOM in Wan2.2 two-stage inference) by loading EMA/consolidated state dicts with map_location="cpu"; load_state_dict copies into the already-on-device parameters. --------- * test: include find_unused_parameters in ddp manager args expectation The DDP strategy config exposes find_unused_parameters (default False), so _build_diffusion_parallel_manager_args returns it in the ddp branch. Update the test's expected dict to match, fixing the L0 unit test failure. * fix(distributed): address Claude review comments - infrastructure.py: forward the model wrapper's mp_policy (from FSDP2Config) to the MoE expert parallelizer when MoEParallelizerConfig.mp_policy is unset, so a custom precision policy isn't silently dropped for EP models. - skills/nemo-automodel-distributed-training/SKILL.md: fix stale references — MeshContext no longer holds strategy_config/pipeline_config/moe_config and STRATEGY_MAP moved to _STRATEGY_MAP in config.py; MoEParallelizerConfig now lives in components/distributed/config.py. --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Signed-off-by: Adil Asif <adasif@nvidia.com> Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> Signed-off-by: thyways <2484113689@qq.com> Signed-off-by: khazic <khazzz1c@gmail.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Signed-off-by: linnan wang <linnanw@nvidia.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Yuhe Zhang <yuhezhang.zju@gmail.com> Co-authored-by: khazzz1c <khazzz1c@gmail.com> Co-authored-by: thyways <2484113689@qq.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Pranav Thombre <pthombre@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Co-authored-by: linnan wang <linnanw@nvidia.com>
feat(vlm): enable Qwen3.5 MoE VLM CP (#2432) * feat(vlm): enable Qwen3.5 MoE VLM CP * test(vlm): cover Qwen3.5 MoE VLM CP changes Add unit coverage for the new/changed code paths in PR #2432: - cp_utils: opt-in seq_index CP buffer, singleton expansion, arange-continued padding - Qwen3_5MoeBlock.forward: seq_index threading into linear_attn, stripping on full-attn path - prepare_model_inputs_for_cp / _pre_embed_only dispatch and text-only forward path - PreTokenizedDatasetWrapper inject_fake_images gating + build_dataloader passthrough - _run_validation_epoch: total_tokens not summed over CP ranks * style(vlm): sort imports in qwen3_5_moe model.py Fixes ruff I001 (unsorted import block) flagged by CI linting: `import inspect` was added above `import copy`. * refactor(qwen): keep CP seq index out of cp utils * rename qwen medpix cp2 config * test(qwen): align CP seq-index tests with cp_linear_attn refactor The "keep CP seq index out of cp utils" refactor moved seq_index handling out of make_cp_batch_and_ctx and prepare_model_inputs_for_cp into CPAwareGatedDeltaNet. Update tests accordingly: - drop obsolete seq_index buffer/padding tests from test_cp_utils - prepare_model_inputs_for_cp now returns only inputs_embeds + position_ids - rewrite TestExtractLocalPositions -> TestExtractLocalSeqIndex for the new _extract_local_seq_index signature - add coverage for _build_dual_chunk_local_positions (DualChunkSwap layout) --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(model): flux2 (#2145) * flux2 init draft * udpate * fix(diffusion): revert flux1 example and add flux2 inference config - Revert accidental FLUX.2 changes in flux_t2i_flow.yaml back to FLUX.1-dev - Add examples/diffusion/generate/configs/generate_flux2.yaml for FLUX.2-dev inference * fix(diffusion): fix flux2 contiguity and text encoder eval - Add .contiguous() after permute in _pack_latents and _unpack_latents so hidden_states is always contiguous before flash-attention kernel - Call pipeline.text_encoder.eval() after device placement, consistent with FluxProcessor, WanProcessor, and QwenImageProcessor * feat(diffusion): sync flux2 configs with main performance fields Add optimizer flags (foreach/fused), performance block, FSDP2 prefetch tuning, save_checkpoint_every_epoch, and save_consolidated=final to flux2_t2i_flow.yaml and flux2_t2i_flow_lora.yaml to match the fields added to flux_t2i_flow.yaml in main. * fix(diffusion): fix flux2 cfg dropout to apply per-sample not per-batch Replace single random.random() gate (correlated across entire batch) with a per-sample Bernoulli mask so each sample independently has cfg_dropout_prob chance of receiving zeroed text embeddings. Also drop the now-unused `import random`. * fix(diffusion): fix flux cfg dropout to apply per-sample not per-batch Replace single random.random() gate (correlated across entire batch) with a per-sample Bernoulli mask so each sample independently has cfg_dropout_prob chance of receiving zeroed text/pooled embeddings. Also drop the now-unused `import random`. * test(diffusion): add unit tests for Flux2Adapter and Flux2Processor - tests/unit_tests/flow_matching/test_flux2_adapter.py: 36 tests covering pack/unpack roundtrip + contiguity, 4D positional IDs (img_ids/txt_ids) shape/dtype/value correctness, prepare_inputs keys/shapes/normalization/ CFG dropout, and forward model call kwargs - tests/unit_tests/diffusion_processors/test_flux2_processor.py: 22 tests covering model_type/default_model_name properties, encode_image BN normalization + dtype + squeeze, encode_text Mistral3 args + no-clip keys, verify_latent shape/NaN/Inf checks, get_cache_data structure, and ProcessorRegistry lookup --------- Signed-off-by: linnan wang <linnanw@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: linnan wang <linnanw@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: gitlab-runner <gitlab-runner@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
beep boop 🤖: Updating transformers to latest version on pypi