Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song

页面笔记

划线列表

likelihood-based models

implicit generative models

require strong restrictions

ensure a tractable normalizing constant for likelihood computation

rely on surrogate objectives to approximate maximum likelihood training

require adversarial training, which is notoriously unstable

can lead to mode collapse

The key idea is to model the gradient of the log probability density function, a quantity often known as the (Stein) score function

not required to have a tractable normalizing constant

achieved state-of-the-art performance

allowing exact likelihood computation and representation learning

facilitates inverse problem solving

restrict their model architectures (e.g., causal convolutions in autoregressive models, invertible networks in normalizing flow models) to make Zθ tractable

approximate the normalizing constant (e.g., variational inference in VAEs, or MCMC sampling used in contrastive divergence) which may be computationally expensive

By modeling the score function instead of the density function, we can sidestep the difficulty of intractable normalizing constants

score-based model sθ(x) is independent of the normalizing constant Zθ

score matching Commonly used score matching methods include denoising score matching and sliced score matching . Here is an introduction to score matching and sliced score matching. that minimize the Fisher divergence without knowledge of the ground-truth data score

we can represent a distribution by modeling its score function, which can be estimated by training a score-based model of free-form architectures with score matching.

the estimated score functions are inaccurate in low density regions

Since the ℓ2 differences between the true data score function and score-based model are weighted by p(x), they are largely ignored in low density regions where p(x) is small

derail

perturb data points with noise and train score-based models on the noisy data points instead

noise magnitude is sufficiently large, it can populate low data density regions to improve the accuracy of estimated scores

how do we choose an appropriate noise scale for the perturbation process?

we use multiple scales of noise perturbations simultaneously

noise-perturbed distribution

Noise Conditional Score-Based Model

By generalizing the number of noise scales to infinity , we obtain not only higher quality samples, but also, among others, exact log-likelihood computation, and controllable generation for inverse problem solving.

perturb the data distribution with continuously growing levels of noise. In this case, the noise perturbation procedure is a continuous-time stochastic process, as demonstrated below

adaptive step-size SDE solvers that can generate samples faster with better quality

two special properties of our reverse SDE

Predictor-Corrector samplers

first use the predictor to choose a proper step size

then predict x(t+Δt) based on the current sample x(t)

Next, we run several corrector steps to improve the sample x(t+Δt) according to our score-based model sθ(x,t+Δt),

The corresponding ODE of an SDE is named probability flow ODE

Score-based generative models are particularly suitable for solving inverse problems

three crucial improvements that can lead to extremely good samples: (1) perturbing data with multiple scales of noise, and training score-based models for each noise scale; (2) using a U-Net architecture (we used RefineNet since it is a modern version of U-Nets) for the score-based model; (3) applying Langevin MCMC to each noise scale and chaining them together.

Without awareness of this work, score-based generative modeling was proposed and motivated independently from a very different perspective. Despite both perturbing data with multiple scales of noise, the connection between score-based generative modeling and diffusion probabilistic modeling seemed superficial at that time, since the former is trained by score matching and sampled by Langevin dynamics, while the latter is trained by the evidence lower bound (ELBO) and sampled with a learned decoder

They showed that the ELBO used for training diffusion probabilistic models is essentially equivalent to the weighted combination of score matching objectives used in score-based generative modeling

by parameterizing the decoder as a sequence of score-based models with a U-Net architecture, they demonstrated for the first time that diffusion probabilistic models can also generate high quality image samples comparable or superior to GANs

Predictor-Corrector sampler

By generalizing the number of noise scales to infinity, we further proved that score-based generative models and diffusion probabilistic models can both be viewed as discretizations to stochastic differential equations determined by score functions.

. The perspective of score matching and score-based models allows one to calculate log-likelihoods exactly, solve inverse problems naturally, and is directly connected to energy-based models

The perspective of diffusion models is naturally connected to VAEs, lossy compression, and can be directly incorporated with variational probabilistic inference

全文剪藏