We are concerned about the authors' artificial introduction of guidance_1 into the text-to-image model. Specifically, the authors' code below shows that the first denoising step of the text-to-image model manually sets curr_guidance to 1.0 to additionally increase sample diversity.
guidance_1 = torch.full([1], 1.0, device=device, dtype=torch.float32).expand(1)
if guidance is not None and skip_first_cfg and i == 0:
curr_guidance = guidance_1
else:
curr_guidance = guidance
However, the base FLUX model sets guidance (curr_guidance) to 3.5 throughout the entire sample process, which is clearly unfair. In particular, the authors have not stated or explained this special configuration in the paper. We question the authors' experimental results and look forward to their response.
We are concerned about the authors' artificial introduction of guidance_1 into the text-to-image model. Specifically, the authors' code below shows that the first denoising step of the text-to-image model manually sets curr_guidance to 1.0 to additionally increase sample diversity.
guidance_1 = torch.full([1], 1.0, device=device, dtype=torch.float32).expand(1)
if guidance is not None and skip_first_cfg and i == 0:
curr_guidance = guidance_1
else:
curr_guidance = guidance
However, the base FLUX model sets guidance (curr_guidance) to 3.5 throughout the entire sample process, which is clearly unfair. In particular, the authors have not stated or explained this special configuration in the paper. We question the authors' experimental results and look forward to their response.