Inference Optimization Techniques

DiffSynth-Studio aims to drive technological innovation through its foundational framework. This article demonstrates how to build a training-free image generation enhancement solution using DiffSynth-Studio, taking Inference-time scaling as an example.

Notebook: https://github.com/modelscope/DiffSynth-Studio/blob/main/docs/en/Research_Tutorial/inference_time_scaling.ipynb

1. Image Quality Quantification

First, we need to find an indicator to quantify image quality from generation models. Manual scoring is the most straightforward solution but too costly for large-scale applications. However, after collecting manual scores, training an image classification model to predict human scoring is completely feasible. PickScore [1] is such a model. Running the following code will automatically download and load the PickScore model.

from modelscope import AutoProcessor, AutoModel
import torch

class PickScore(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.processor = AutoProcessor.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")
        self.model = AutoModel.from_pretrained("AI-ModelScope/PickScore_v1").eval().to("cuda")

    def forward(self, image, prompt):
        image_inputs = self.processor(images=image, padding=True, truncation=True, max_length=77, return_tensors="pt").to("cuda")
        text_inputs = self.processor(text=prompt, padding=True, truncation=True, max_length=77, return_tensors="pt").to("cuda")
        with torch.inference_mode():
            image_embs = self.model.get_image_features(**image_inputs).pooler_output
            image_embs = image_embs / torch.norm(image_embs, dim=-1, keepdim=True)
            text_embs = self.model.get_text_features(**text_inputs).pooler_output
            text_embs = text_embs / torch.norm(text_embs, dim=-1, keepdim=True)
            score = (text_embs @ image_embs.T).flatten().item()
        return score

reward_model = PickScore()

2. Inference-time Scaling Techniques

Inference-time Scaling [2] is an interesting technique aiming to improve generation quality by increasing computational costs during inference. For example, in language models, models like Qwen/Qwen3.5-27B and deepseek-ai/DeepSeek-R1 use “thinking mode” to guide the model to spend more time considering results more carefully, producing more accurate answers. Next, we’ll use the black-forest-labs/FLUX.2-klein-4B model as an example to explore how to design Inference-time Scaling solutions for image generation models.

Before starting, we slightly modified the Flux2ImagePipeline code to allow initialization with specific Gaussian noise matrices for result reproducibility. See Flux2Unit_NoiseInitializer in diffsynth/pipelines/flux2_image.py.

Run the following code to load the black-forest-labs/FLUX.2-klein-4B model.

from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig

pipe = Flux2ImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="text_encoder/*.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="transformer/*.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="tokenizer/"),
)

Generate a sketch cat image using the prompt "sketch, a cat" and score it with the PickScore model.

def evaluate_noise(noise, pipe, reward_model, prompt):
    # Generate an image and compute the score.
    image = pipe(
        prompt=prompt,
        num_inference_steps=4,
        initial_noise=noise,
        progress_bar_cmd=lambda x: x,
    )
    score = reward_model(image, prompt)
    return score

torch.manual_seed(1)
prompt = "sketch, a cat"
noise = pipe.generate_noise((1, 128, 64, 64), rand_device="cuda", rand_torch_dtype=pipe.torch_dtype)

image_1 = pipe(prompt, num_inference_steps=4, initial_noise=noise)
print("Score:", reward_model(image_1, prompt))
image_1

Image