These notes describe the ComfyUI LoRA trainer added for HiDream O1 and the defaults chosen to match AI Toolkit's working O1 implementation.
HiDream O1 is a Qwen3-VL style model used as a pixel-level diffusion transformer. Instead of using a VAE latent UNet, the trainer patchifies RGB images directly, adds flow-matching noise, feeds the noisy image patches through the model's vision path, and trains LoRA layers on the model prediction.
The model returns an x0 prediction. AI Toolkit converts that x0 prediction into a flow velocity before applying the loss. This trainer now follows the same idea:
scaled_noise = noise * noise_scale
noisy_image = (1 - sigma) * image + sigma * scaled_noise
velocity_pred = (noisy_image - x0_pred) / sigma
velocity_target = scaled_noise - image
loss = mse(velocity_pred, velocity_target)
Default noise_scale is 8.0, matching AI Toolkit's HiDream O1 model kwargs.
Use a Full model for LoRA training:
HiDream-O1-Image-BF16is the default Comfy training choice.HiDream-O1-Image-FP8can be used when VRAM is tight, with BF16 compute where supported.
Dev and Dev-2604 are intentionally not exposed in the training node. They are distilled inference variants and can train unpredictably. Upstream also recommends the Full model for editing tasks.
Create a folder with image files and matching .txt captions:
my_dataset/
img_0001.png
img_0001.txt
img_0002.jpg
img_0002.txt
Run:
HiDream O1 Dataset Maker -> HiDream O1 Train Config -> HiDream O1 LoRA Trainer
The Dataset Maker writes a JSONL manifest that the trainer consumes.
For a person LoRA, keep captions consistent and direct:
photo of m0n0y0 person, close-up portrait, curly black hair, city street background
photo of m0n0y0 person, upper body portrait, neutral expression, natural light
photo of m0n0y0 person, candid portrait, black jacket, outdoor background
Use a unique trigger token that does not look like a common word. Put it early in every caption. Avoid captions that over-describe saturation, high contrast, harsh lighting, heavy cinematic grading, or stylized color if you want identity rather than baked-in style.
The current defaults are intended to match AI Toolkit's O1 path:
| Setting | Default | Why |
|---|---|---|
base_model_name |
HiDream-O1-Image-BF16 |
Full model training, local converted BF16 default |
learning_rate |
0.0001 |
AI Toolkit job default |
lora_rank |
32 |
AI Toolkit linear LoRA default |
lora_alpha |
32 |
AI Toolkit linear alpha default |
target_preset |
aitoolkit |
Train linear-like layers except lm_head, patch_embed, visual |
loss_target |
velocity |
Match AI Toolkit x0-to-flow-velocity loss |
noise_scale |
8.0 |
Match AI Toolkit HiDream O1 noise scale |
timestep_type |
linear |
AI Toolkit O1 override |
min_sigma / max_sigma |
0.001 / 0.999 |
Broad flow-matching timestep coverage |
max_loss |
1.0 |
Cap loss spikes like AI Toolkit |
weight_decay |
0.0001 |
AI Toolkit AdamW default |
warmup_steps |
0 |
AI Toolkit default scheduler behavior |
resolution |
1024 |
Practical direct-pixel training size |
caption_dropout |
0.05 |
AI Toolkit dataset default |
save_every_steps |
250 |
Frequent checkpoint comparison |
Do not judge from one final checkpoint only. Save at least every 250 steps and sample several strengths:
step 250: strengths 0.5, 0.7, 1.0
step 500: strengths 0.5, 0.7, 1.0
step 750: strengths 0.5, 0.7, 1.0
step 1000: strengths 0.5, 0.7, 1.0
For small person datasets, useful identity often appears between 500 and 1500 steps. If the image gets oversaturated, overly contrasty, or starts copying dataset lighting too strongly, stop earlier or lower inference strength.
If the LoRA does not resemble the person:
- Make sure every caption contains the same trigger token.
- Use clearer face crops and varied angles.
- Keep the training target at
aitoolkit;attentiononly can be too weak for identity. - Try more steps before changing many settings.
- Sample with simpler prompts first, such as
photo of m0n0y0 person, portrait.
If the LoRA is overbaked:
- Test earlier checkpoints.
- Lower LoRA strength at inference to
0.5or0.7. - Remove strong color/lighting words from captions.
- Lower learning rate to
5e-5. - Avoid training on many near-duplicate images.
If colors blow out immediately:
- Confirm
loss_target=velocity. - Confirm
noise_scale=8.0. - Confirm you restarted ComfyUI after updating the node.
- Recreate the Train Config node so old workflow defaults are not stuck in the graph.
- Batch size is currently effectively one image per micro-step.
- No resume training node yet.
- No reference-image/edit/IP LoRA training path yet.
- No validation sampler is run during training; compare saved checkpoints manually.
- The trainer runs in-process and blocks the ComfyUI queue.
Dev-2604 is available in the model loader for inference:
HiDream-O1-Image-Dev-2604-BF16HiDream-O1-Image-Dev-2604-FP16HiDream-O1-Image-Dev-2604-FP8
For Dev edit mode with exactly one reference image, the sampler defaults to flow_match, matching the May 13, 2026 upstream scheduler update. Text/subject Dev paths still use the fixed 28-step flash schedule by default.