Image Quality Evaluation Metrics
DiffSynth-Studio provides a suite of image quality evaluation metrics and reward models in diffsynth.metrics to assess text alignment, aesthetic quality, human preference, and image distribution quality of generated images. Example code for these metrics can be found in examples/image_quality_metric/.
Installation
Before using this project for model inference and training, please install DiffSynth-Studio first.
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
For more information about installation, please refer to Install Dependencies.
Quick Start
Run the following code to quickly load PickScore and score an image against a prompt. The default models will be downloaded from ModelScope to ./models.
from diffsynth.metrics import PickScoreMetric, ModelConfig
from modelscope import dataset_snapshot_download
from PIL import Image
dataset_snapshot_download(
"DiffSynth-Studio/diffsynth_example_dataset",
allow_file_pattern="flux/FLUX.1-dev/*",
local_dir="./data/diffsynth_example_dataset",
)
image = Image.open("data/diffsynth_example_dataset/flux/FLUX.1-dev/1.jpg").convert("RGB")
prompt = "a dog"
metric = PickScoreMetric.from_pretrained(
model_config=ModelConfig(model_id="DiffSynth-Studio/ImageMetrics", origin_file_pattern="PickScore/model.safetensors"),
device="cuda"
)
score = metric.compute(prompt, image)[0]
print(f"PickScore score:: {score:.3f}")
Metrics Overview
| Metric | Input | Output | Example Code |
|---|---|---|---|
| PickScore | prompt + PIL Image | Preference Score | code |
| ImageReward | prompt + PIL Image | Preference Score | code |
| HPSv2 | prompt + PIL Image | Preference Score | code |
| HPSv3 | prompt + PIL Image | Preference Score | code |
| CLIP Score | prompt + PIL Image | Text-Image Similarity | code |
| UnifiedReward 2.0 | prompt + PIL Image | multi-dimension scores | code |
| Qwen-Image-Bench | prompt + PIL Image | Overall score and multi-level dimension scores | code |
| UnifiedReward Edit | editing instruction + source image + edited image | Image editing quality score | code |
| Aesthetic | PIL Image | Aesthetic Score | code |
| FID | reference image directory + generated image directory | Distribution Distance | code |
Text-Image Alignment and Preference Evaluation
Applicable metrics: PickScore, ImageReward, HPSv2, HPSv3, CLIP Score, UnifiedReward 2.0, Qwen-Image-Bench
These models are used to evaluate whether an image follows the prompt and aligns with human visual preferences. They must receive both the prompt and the image simultaneously.
Basic Scoring
score = metric.compute(prompt, image)[0]
Batch Scoring
If you need to evaluate multiple images, you can directly pass a list:
scores = metric.compute("a cute cat", [image1, image2, image3])
scores = metric.compute(["a cat", "a dog"], [image_cat, image_dog])
When prompt is a single string, the same prompt will be applied to every image. When prompt is a list of strings, the number of prompts must exactly match the number of images.
Multi-Dimensional Image Quality Evaluation
Applicable metrics: UnifiedReward 2.0, Qwen-Image-Bench
These metrics also receive a prompt and an image, but in addition to the primary score, evaluate() returns more detailed evaluation dimensions. They are useful when you need to analyze text-image alignment, visual coherence, style, or multi-level quality dimensions.
Qwen-Image-Bench
from diffsynth.metrics import ModelConfig, QwenImageBenchMetric
metric = QwenImageBenchMetric.from_pretrained(
model_config=ModelConfig(
model_id="Qwen/Qwen-Image-Bench",
origin_file_pattern="model-*.safetensors",
),
processor_config=ModelConfig(
model_id="Qwen/Qwen-Image-Bench",
origin_file_pattern="",
),
device="cuda",
)
details = metric.evaluate(prompt, image)[0]
score = details["total_score"]
print(details["level1_scores"])
print(details["level2_scores"])
If you only need the primary score, you can also call metric.compute(prompt, image).
Image Editing Quality Evaluation
Applicable metric: UnifiedReward Edit
UnifiedReward Edit evaluates whether an edited image follows the editing instruction and whether it is over-edited. The input usually includes an editing instruction, a source image, and edited image candidates. It supports three tasks:
edit_pointwise_score: scores a single edited result with[source_image, edited_image].edit_pairwise_rank: compares two edited results and returns the winner with[source_image, edited_image_1, edited_image_2].edit_pairwise_score: returns separate scores for two edited results with[source_image, edited_image_1, edited_image_2].
from diffsynth.metrics import ModelConfig, UnifiedRewardEditMetric
metric = UnifiedRewardEditMetric.from_pretrained(
model_config=ModelConfig(
model_id="DiffSynth-Studio/ImageMetrics",
origin_file_pattern="UnifiedReward-Edit-qwen3vl-8b/model-*.safetensors",
),
processor_config=ModelConfig(
model_id="DiffSynth-Studio/ImageMetrics",
origin_file_pattern="UnifiedReward-Edit-qwen3vl-8b/",
),
device="cuda",
)
details = metric.evaluate(
instruction,
[source_image, edited_image],
task="edit_pointwise_score",
)[0]
print(details["score"], details["editing_success"], details["overediting"])
Pure Image Aesthetics Evaluation
Applicable metric: Aesthetic
This model solely evaluates aesthetic features such as the composition, color, and clarity of the image itself. It does not require a prompt.
from diffsynth.metrics import AestheticMetric
metric = AestheticMetric.from_pretrained(device="cuda")
score = metric.compute(image)[0]
Dataset Distribution Evaluation
Applicable metric: FID (Fréchet Inception Distance)
FID does not score individual images; instead, it compares the overall feature distribution distance between a real reference image set and a generated image set. A lower score indicates that the generated distribution is closer to the real distribution.
from diffsynth.metrics import FIDMetric
reference_dir = "path/to/real_reference_images"
generated_dir = "path/to/model_generated_images"
metric = FIDMetric.from_pretrained(device="cuda", batch_size=16)
fid_score = metric.compute(reference_dir, generated_dir)
print(f"FID: {fid_score:.3f}")
The baseline for FID is not fixed or unique. For general image generation, COCO Validation is commonly used; for specific domains (such as medical images or e-commerce products), a reference_dir composed of real data from that specific domain should be provided.
Important Notes
The scores from PickScore, ImageReward, HPSv2, HPSv3, CLIPScore, UnifiedReward 2.0, Qwen-Image-Bench, UnifiedReward Edit, and Aesthetic are suitable for relative comparison within the same metric. It is not recommended to directly compare the numerical values across different metrics.
HPSv3, UnifiedReward 2.0, UnifiedReward Edit, and Qwen-Image-Bench are based on multimodal large models, requiring significantly more VRAM than CLIP-based metrics.
FID is sensitive to the choice of reference, the reference sample size, and the generated sample size.