Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song
页面笔记
划线列表
█ likelihood-based models
█ implicit generative models
█ require strong restrictions
█ ensure a tractable normalizing constant for likelihood computation
█ rely on surrogate objectives to approximate maximum likelihood training
█ require adversarial training, which is notoriously unstable
█ can lead to mode collapse
█ The key idea is to model the gradient of the log probability density function, a quantity often known as the (Stein) score function
█ not required to have a tractable normalizing constant
█ achieved state-of-the-art performance
█ allowing exact likelihood computation and representation learning
█ facilitates inverse problem solving
█ restrict their model architectures (e.g., causal convolutions in autoregressive models, invertible networks in normalizing flow models) to make Zθ tractable
█ approximate the normalizing constant (e.g., variational inference in VAEs, or MCMC sampling used in contrastive divergence) which may be computationally expensive
█ By modeling the score function instead of the density function, we can sidestep the difficulty of intractable normalizing constants
█ score-based model sθ(x) is independent of the normalizing constant Zθ
█ score matching Commonly used score matching methods include denoising score matching and sliced score matching . Here is an introduction to score matching and sliced score matching. that minimize the Fisher divergence without knowledge of the ground-truth data score
█ we can represent a distribution by modeling its score function, which can be estimated by training a score-based model of free-form architectures with score matching.
█ the estimated score functions are inaccurate in low density regions
█ Since the ℓ2 differences between the true data score function and score-based model are weighted by p(x), they are largely ignored in low density regions where p(x) is small
█ derail
█ perturb data points with noise and train score-based models on the noisy data points instead
█ noise magnitude is sufficiently large, it can populate low data density regions to improve the accuracy of estimated scores
█ how do we choose an appropriate noise scale for the perturbation process?
█ we use multiple scales of noise perturbations simultaneously
█ noise-perturbed distribution
█ Noise Conditional Score-Based Model
█ By generalizing the number of noise scales to infinity , we obtain not only higher quality samples, but also, among others, exact log-likelihood computation, and controllable generation for inverse problem solving.
█ perturb the data distribution with continuously growing levels of noise. In this case, the noise perturbation procedure is a continuous-time stochastic process, as demonstrated below
█ adaptive step-size SDE solvers that can generate samples faster with better quality
█ two special properties of our reverse SDE
█ Predictor-Corrector samplers
█ first use the predictor to choose a proper step size
█ then predict x(t+Δt) based on the current sample x(t)
█ Next, we run several corrector steps to improve the sample x(t+Δt) according to our score-based model sθ(x,t+Δt),
█ The corresponding ODE of an SDE is named probability flow ODE
█ Score-based generative models are particularly suitable for solving inverse problems
█ three crucial improvements that can lead to extremely good samples: (1) perturbing data with multiple scales of noise, and training score-based models for each noise scale; (2) using a U-Net architecture (we used RefineNet since it is a modern version of U-Nets) for the score-based model; (3) applying Langevin MCMC to each noise scale and chaining them together.
█ Without awareness of this work, score-based generative modeling was proposed and motivated independently from a very different perspective. Despite both perturbing data with multiple scales of noise, the connection between score-based generative modeling and diffusion probabilistic modeling seemed superficial at that time, since the former is trained by score matching and sampled by Langevin dynamics, while the latter is trained by the evidence lower bound (ELBO) and sampled with a learned decoder
█ They showed that the ELBO used for training diffusion probabilistic models is essentially equivalent to the weighted combination of score matching objectives used in score-based generative modeling
█ by parameterizing the decoder as a sequence of score-based models with a U-Net architecture, they demonstrated for the first time that diffusion probabilistic models can also generate high quality image samples comparable or superior to GANs
█ Predictor-Corrector sampler
█ By generalizing the number of noise scales to infinity, we further proved that score-based generative models and diffusion probabilistic models can both be viewed as discretizations to stochastic differential equations determined by score functions.
█ . The perspective of score matching and score-based models allows one to calculate log-likelihoods exactly, solve inverse problems naturally, and is directly connected to energy-based models
█ The perspective of diffusion models is naturally connected to VAEs, lossy compression, and can be directly incorporated with variational probabilistic inference