# Image Quality Evaluation Metrics DiffSynth-Studio provides a suite of image quality evaluation metrics and reward models in `diffsynth.metrics` to assess text alignment, aesthetic quality, human preference, and image distribution quality of generated images. Example code for these metrics can be found in [`examples/image_quality_metric/`](../../../examples/image_quality_metric/). ## Installation Before using this project for model inference and training, please install DiffSynth-Studio first. ```shell git clone https://github.com/modelscope/DiffSynth-Studio.git cd DiffSynth-Studio pip install -e . ``` For more information about installation, please refer to [Install Dependencies](../Pipeline_Usage/Setup.md). ## Quick Start Run the following code to quickly load PickScore and score an image against a prompt. The default models will be downloaded from ModelScope to `./models`. ```python from diffsynth.metrics import PickScoreMetric, ModelConfig from modelscope import dataset_snapshot_download from PIL import Image dataset_snapshot_download( "DiffSynth-Studio/diffsynth_example_dataset", allow_file_pattern="flux/FLUX.1-dev/*", local_dir="./data/diffsynth_example_dataset", ) image = Image.open("data/diffsynth_example_dataset/flux/FLUX.1-dev/1.jpg").convert("RGB") prompt = "a dog" metric = PickScoreMetric.from_pretrained( model_config=ModelConfig(model_id="DiffSynth-Studio/ImageMetrics", origin_file_pattern="PickScore/model.safetensors"), device="cuda" ) score = metric.compute(prompt, image)[0] print(f"PickScore score:: {score:.3f}") ``` ## Metrics Overview | Metric | Input | Output | Example Code | | --- | --- | --- | --- | | PickScore | prompt + PIL Image | Preference Score | [code](../../../examples/image_quality_metric/pickscore.py) | | ImageReward | prompt + PIL Image | Preference Score | [code](../../../examples/image_quality_metric/image_reward.py) | | HPSv2 | prompt + PIL Image | Preference Score | [code](../../../examples/image_quality_metric/hpsv2.py) | | HPSv3 | prompt + PIL Image | Preference Score | [code](../../../examples/image_quality_metric/hpsv3.py) | | CLIP Score | prompt + PIL Image | Text-Image Similarity | [code](../../../examples/image_quality_metric/clipscore.py) | | UnifiedReward 2.0 | prompt + PIL Image | multi-dimension scores | [code](../../../examples/image_quality_metric/unified_reward_2.py) | | Qwen-Image-Bench | prompt + PIL Image | Overall score and multi-level dimension scores | [code](../../../examples/image_quality_metric/qwen_image_bench.py) | | UnifiedReward Edit | editing instruction + source image + edited image | Image editing quality score | [code](../../../examples/image_quality_metric/unified_reward_edit.py) | | Aesthetic | PIL Image | Aesthetic Score | [code](../../../examples/image_quality_metric/aesthetic.py) | | FID | reference image directory + generated image directory | Distribution Distance | [code](../../../examples/image_quality_metric/fid.py) | ### Text-Image Alignment and Preference Evaluation Applicable metrics: **PickScore**, **ImageReward**, **HPSv2**, **HPSv3**, **CLIP Score**, **UnifiedReward 2.0**, **Qwen-Image-Bench** These models are used to evaluate whether an image follows the prompt and aligns with human visual preferences. They must receive both the `prompt` and the `image` simultaneously. **Basic Scoring** ```python score = metric.compute(prompt, image)[0] ``` **Batch Scoring** If you need to evaluate multiple images, you can directly pass a list: ```python scores = metric.compute("a cute cat", [image1, image2, image3]) scores = metric.compute(["a cat", "a dog"], [image_cat, image_dog]) ``` When prompt is a single string, the same prompt will be applied to every image. When prompt is a list of strings, the number of prompts must exactly match the number of images. ### Multi-Dimensional Image Quality Evaluation Applicable metrics: **UnifiedReward 2.0**, **Qwen-Image-Bench** These metrics also receive a `prompt` and an `image`, but in addition to the primary score, `evaluate()` returns more detailed evaluation dimensions. They are useful when you need to analyze text-image alignment, visual coherence, style, or multi-level quality dimensions. **Qwen-Image-Bench** ```python from diffsynth.metrics import ModelConfig, QwenImageBenchMetric metric = QwenImageBenchMetric.from_pretrained( model_config=ModelConfig( model_id="Qwen/Qwen-Image-Bench", origin_file_pattern="model-*.safetensors", ), processor_config=ModelConfig( model_id="Qwen/Qwen-Image-Bench", origin_file_pattern="", ), device="cuda", ) details = metric.evaluate(prompt, image)[0] score = details["total_score"] print(details["level1_scores"]) print(details["level2_scores"]) ``` If you only need the primary score, you can also call `metric.compute(prompt, image)`. ### Image Editing Quality Evaluation Applicable metric: **UnifiedReward Edit** UnifiedReward Edit evaluates whether an edited image follows the editing instruction and whether it is over-edited. The input usually includes an editing instruction, a source image, and edited image candidates. It supports three tasks: * `edit_pointwise_score`: scores a single edited result with `[source_image, edited_image]`. * `edit_pairwise_rank`: compares two edited results and returns the winner with `[source_image, edited_image_1, edited_image_2]`. * `edit_pairwise_score`: returns separate scores for two edited results with `[source_image, edited_image_1, edited_image_2]`. ```python from diffsynth.metrics import ModelConfig, UnifiedRewardEditMetric metric = UnifiedRewardEditMetric.from_pretrained( model_config=ModelConfig( model_id="DiffSynth-Studio/ImageMetrics", origin_file_pattern="UnifiedReward-Edit-qwen3vl-8b/model-*.safetensors", ), processor_config=ModelConfig( model_id="DiffSynth-Studio/ImageMetrics", origin_file_pattern="UnifiedReward-Edit-qwen3vl-8b/", ), device="cuda", ) details = metric.evaluate( instruction, [source_image, edited_image], task="edit_pointwise_score", )[0] print(details["score"], details["editing_success"], details["overediting"]) ``` ### Pure Image Aesthetics Evaluation Applicable metric: **Aesthetic** This model solely evaluates aesthetic features such as the composition, color, and clarity of the image itself. It does not require a prompt. ```python from diffsynth.metrics import AestheticMetric metric = AestheticMetric.from_pretrained(device="cuda") score = metric.compute(image)[0] ``` ### Dataset Distribution Evaluation Applicable metric: **FID** (Fréchet Inception Distance) FID does not score individual images; instead, it compares the overall feature distribution distance between a real reference image set and a generated image set. A lower score indicates that the generated distribution is closer to the real distribution. ```python from diffsynth.metrics import FIDMetric reference_dir = "path/to/real_reference_images" generated_dir = "path/to/model_generated_images" metric = FIDMetric.from_pretrained(device="cuda", batch_size=16) fid_score = metric.compute(reference_dir, generated_dir) print(f"FID: {fid_score:.3f}") ``` The baseline for FID is not fixed or unique. For general image generation, COCO Validation is commonly used; for specific domains (such as medical images or e-commerce products), a `reference_dir` composed of real data from that specific domain should be provided. ## Important Notes * The scores from PickScore, ImageReward, HPSv2, HPSv3, CLIPScore, UnifiedReward 2.0, Qwen-Image-Bench, UnifiedReward Edit, and Aesthetic are suitable for relative comparison within the same metric. It is not recommended to directly compare the numerical values across different metrics. * HPSv3, UnifiedReward 2.0, UnifiedReward Edit, and Qwen-Image-Bench are based on multimodal large models, requiring significantly more VRAM than CLIP-based metrics. * FID is sensitive to the choice of reference, the reference sample size, and the generated sample size.