[2311.12908] Diffusion Model Alignment Using Direct Preference Optimization
Diffusion Model Alignment Using Direct Preference Optimization
Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely e
arxiv.org
문제점
LLM : 2 stage trianing
Text-to-Image Diffusion Models : Alignment 단계 없음
제안 : Diffusion DPO
LLM에서의 정렬
RLHF
but
대안
Diffusion에서의 정렬 시도
비교

3.2. Direct Preference Optimization
Reward Modeling (reward model 학습!)



RLHF

DPO Objective


정리
문제점
해결 : ELBO


최종:

댓글 영역