[논문 리뷰] The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

논문 리뷰

by jii 2025. 7. 23. 00:00

[1801.03924] The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simpl

arxiv.org

1. Motivation

이미지의 유사성을 판단하는 방법

기존 : 픽셀 기반 유사도
각 픽셀의 독립성 가정
문제점 : 이미지는 구조적으로 연관 → 이미지를 블러 처리하면 사람 판단으로는 다름, L2 거리 변화는 작음

목표 : 사람의 판단과 일치하는 perceptual 거리를 원함

기존 제안 : SSIM, MSSIM, FSIM, HDR-VDP

이 task가 어려운 이유

사람의 판단은 고차원 구조에 의존
맥락 의존성 → 비교시 어떤 속성을 더 중요하게 생각하는지
수학적 거리 성질 만족x

해결책 : perceptual similarity를 직접 학습하지 않고, 간접적으로 학습된 네트워크에서 얻자

이미지 분류 같은 다른 task를 위해 학습된 딥러닝 네트워크의 내부 표현이 유용
perceptual loss : 두 이미지 간 VGG feature space 상의 거리를 비교하여 이미지 간 유사성 평가

Q. 실제로 딥러닝 네트워크 내부 표현이 perceptual similarity를 잘 반영하는지?

어떤 구조든 분류를 위해 학습된 네트워크의 내부 표현은 기존 지표보다 사람 판단과 더 일치
self-supervised (BiGAN, cross-channel prediction, puzzle-solving) 방식도 +
unsupervised (Stacked k-means) 방식도 +
랜덤 네트워크는 - : 학습은 꼭 필요

BAPPS dataset

traditional distortion
CNN output
실제 알고리즘 결과도 평가에 포함

→ 이 데이터로 이미 학습된 nw의 feature에 간단한 선형 스케일링(weight 조정) 하면 더 일치 +

→ perceptual similarity는 학습에서도 얻어지는 결과물

Prior work on datasets

기존

FR-IQA
NR-IQA : ref 없이, 한 장의 이미지만 보고 그 품질을 평가

차별점

왜곡의 종류 다양
이미지 전체가 아니라 patch 단위
perceptual similarity 자체를 측정하는 데 집중

Prior work on deep networks and human judgments

DNN을 이용한 유사성 평가 연구 증가
차별점 : 여러 아키텍처, 여러 학습 방식, 대규모 데이터셋 사용
[6] : perceptual similarity에 맞춰 네트워크를 훈련해서 most/least noticeable distortion을 잘 예측하는지 평가
[6]과 유사 + 실제 알고리즘에 대한 일반화 성능 + JND

2. Berkeley-Adobe Perceptual Patch Similarity (BAPPS) Dataset

2AFC : 두 개 중 어떤 이미지가 원본과 더 비슷해 보이는지
JND : 두 개의 이미지가 같은지/다른지

2.1.Distortions

Traditional distortions

왜곡 강도 조절 파라미터 있음
조합해서 왜곡의 종류 늘림

CNN-based distortions

Autoencoding (압축 후 복원)
Denoising (노이즈 제거)
Colorization (흑백 → 컬러 복원)
Superresolution (저해상도 → 고해상도 복원)
ImageNet 데이터로 1 epoch만 학습 → 일부러 덜 학습시켜서 오류 생성
실제 딥러닝 모델이 만드는 오류를 생성하기 위해

Distorted image patches from real algorithms

실제 알고리즘이 만들어내는 결과로 평가하는 것이 중요 : 단순 합성 왜곡보다 현실적이므로
알고리즘마다 출력 특성이 다르기 때문에, 한계 o
ex. colorization : 색상 차이, 색 번짐
ex. superresolution : 구조 차이 발생

Superresolution, Frame interpolation, Video deblurring, Colorization

Task	대표 알고리즘	사용 데이터셋	평가 방식
Superresolution	SRCNN, EDSR, VDSR 등	Set5, Set14 등	업샘플링 결과에서 patch triplet 구성
Frame Interpolation	Flow, CNN, Phase	Davis, Middlebury	다양한 스케일에서 patch triplet 구성
Video Deblurring	Photoshop, Fourier, Deep model	[53]의 benchmark	복원 결과물에서 patch triplet 구성
Colorization	pix2pix, Larsson, Zhang	ImageNet	컬러화된 이미지에서 patch triplet 구성

2.2. Psychophysical Similarity Measurements

2AFC similarity judgments

(x, x₀, x₁, h) = 한 triplet에 대한 사람의 판단

	기존 (TID2013 등)	본 논문 데이터셋
이미지 수	적음	많음
왜곡 종류	제한적	다양 (전통 + CNN 기반 + 실제 알고리즘 출력 등)
단위	전체 이미지	64×64 패치 단위
초점	품질 위주	유사성 중심

왜 patch 단위?

전체 이미지 공간 너무 큼
전체 이미지 보면 의미가 판단에 영향을 미침 // low level 정보에 집중 가능
딥러닝 기반 이미지 생성/복원 모델도 대부분 패치 기반 학습

대량 데이터 확보 방법

플랫폼: Amazon Mechanical Turk (AMT)
train set: 각 triplet에 대해 2명의 사람에게 판단 받음
validation set: 각 triplet에 대해 5명의 사람에게 판단 받음
이렇게 해서 161,000개 이상의 패치에 대한 판단을 수집

품질 보장 : Sentinel

의도적으로 난이도 쉬운 쌍 (ex. Gaussian noise 많이 vs. 적게) 삽입
참가자의 신뢰도 확인
결과: 전체 참가자의 90%가 93% 이상 정확하게 맞춤

Just noticeable differences (JND)

2AFC 문제점

비교 기준이 주관적

JND

정답이 존재하기 때문에 (같다/다르다) 주관성 줄임
좋은 perceptual metric은 JND 실험에서 사람이 헷갈리는 정도와 비슷한 순서로 유사도를 평가할 것
즉, 차이를 못느끼는 이미지 쌍은 metric 상으로도 거의 같은 값

3. Deep Feature Spaces

Network architectures

딥러닝 네트워크 중간 feature로 유사도를 구한다!

네트워크

VGG
AlexNet
SqueezeNet
Self-Supervised models

Network activations to distance

ref img와 distorted img 사이 거리 구하기

네트워크의 여러 convolution layer를 선택
각 layer에서 feature map 추출
각 feature는 channel 단위로 정규화 (unit-normalization)
채널별 가중치 wl 적용( 이면, channel-wise 정규화 후 L2 거리는 사실상 cosine 거리와 같아짐 )
L2 거리 계산하고, 공간적 평균 및 채널별 합산

Training on our data

학습 방식 3가지

lin : 사전 학습된 네트워크 그대로 사용, 채널별 가중치 w만 학습
tune : 전체 네트워크 모든 weight를 fine tuning
scratch : 완전히 random init, 사람의 판단으로 학습

=> 모두 LPIPS

학습 목표 : LPIPS가 예측한 거리 d(x,x0), d(x,x1) 를 기반으로 사람의 응답을 최대한 잘 맞추는 방향으로 학습

4. Experiments

평가 : 학습한 metric (LPIPS) 이 실제 사람의 판단과 얼마나 일치하는가?

사람 5명이 선택한 결과와 얼마나 일치하는지를 봄
사람이 선택한 쪽을 맞추면 그 비율만큼 점수 부여
기대값 :

4.1. Evaluations

How well do low-level metrics and classification net works perform?

low-level metric보다 classification network 기반 feature distance가 +

Does the network have to be trained on classification?

꼭 classification task로 학습한 네트워크?
다른 self-supervised나 unsupervised 방식은?
self-/unsupervised 방식으로 학습된 네트워크도 좋은 perceptual feature를 가짐 → 좋은 task는 좋은 표현을 만든다

Do metrics correlate across different perceptual tasks?

한 가지 perceptual judgment task (2AFC)에서 성능이 좋으면 다른 perceptual task (JND)에서도 성능이 좋은가?
2AFC에서의 순위와 JND에서의 순위가 얼마나 유사한지 correlation 측정
상관관계 높음 : 2AFC test 하나만으로도 일반적인 perceptual similarity를 평가 가능

Can we train a metric on traditional and CNN-based distortions?

전통 + CNN 왜곡 데이터로 perceptual similarity metric을 학습 가능?
전체 fine-tune (tune)이 가장 성능 좋음
VGG처럼 큰 네트워크가 더 나은 성능(고용량 +)

Does training on traditional and CNN-based distortions transfer to real-world scenarios?

실험적으로 만든 왜곡이 아니라, 실제 이미지 처리 알고리즘 결과에도 metric 일반화 가능?
학습할 때 사용한 왜곡들(전통적 + CNN 기반)이 실제 알고리즘의 평가 기준과 충분히 일치

Where do deep metrics and low-level metrics disagree?

어떤 상황에서 deep metric과 low-level metric이 서로 다른 판단을 내리는가?
BiGAN은 blur에 민감하지만 noise에는 덜 민감
SSIM은 noise에 민감하고 blur에는 관대

'논문 리뷰' 카테고리의 다른 글

[논문 리뷰] SRCNN : Image Super-Resolution Using Deep Convolutional Networks (0)	2025.08.07
[논문 리뷰] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (0)	2025.07.29
[논문 리뷰] Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization (0)	2025.07.22
[논문 리뷰] Learning Continuous Image Representation with Local Implicit Image Function (0)	2025.07.21
[논문 리뷰] Generative Photography : Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis (0)	2025.07.04

spolov

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

1. Motivation

2. Berkeley-Adobe Perceptual Patch Similarity (BAPPS) Dataset

3. Deep Feature Spaces

4. Experiments

'논문 리뷰' 카테고리의 다른 글

관련글 더보기

댓글 영역

추가 정보

인기글

최신글

티스토리툴바