You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Suppose we have natural data laying in ${(x, 0, 0, 0, 0) : x \in \mathbb{R}} \subseteq{R}^5$, this forms a 1 dimensional manifold (due to it being a subvectorspace). If there is noise applied to for example the natural data $(1, 0, 0, 0, 0)$ we might get something like $(1.1, -0.1, 0.2, 0.3, -0.2) \in \mathbb{R}^5$ and a DAE would for example try to learn to squish all coordinates that are not the first one to 0.
In this example I trained a simple 5 -> 16 -> 16 -> 5 MLP with ReLU activation for the denoising.
Suppose we have a sample point (in this example $z_t = (x_t, y_t) \in \mathbb{R}^2$). We sample
a pre-defined proposal distribution. For the basic Metropolis algorithm, it needs to be symmetric,
in the sense that $p(x|y) = p(y|x)$ holds. Isotropic multivariate Gaussian $\mathcal{N}(0, \sigma^2 I)$ and uniform distribution $U((-a, a) \times (-a, a))$
both satisfy this condition. In this example, we use a Gaussian distribution as the proposal function.
We sample a direction $\delta \sim \mathcal{N}(0, \sigma^2 I)$, and for our target function
(an unnormalised distribution) $\tilde{p}$, we compute the (possibly overflowing) acceptance value
of this step as follows: $\tilde{A}(z_t, z_t + \delta) := \frac{p(z_t + \delta)}{p(z_t)}$.
As we want an acceptance probability, we clip it to be at most 1:
$A(z_t, z_t + \delta) = \min(1, \tilde{A}(z_t, z_t + \delta)) = \min\left(1, \frac{p(z_t + \delta)}{p(z_t)}\right)$.
We then accept this step with probability $A(z_t, z_t + \delta)$, which is done by sampling
$u \sim U(0, 1)$ and checking if $u \leq A(z_t, z_t + \delta)$ (which is always true if $\tilde{A}(z_t, z_t + \delta) \geq 1.0$).
If we accept, we define $z_{t + 1} = z_t + \delta$; if we reject, we simply discard the
step and remain at the same position, i.e., $z_{t + 1} = z_t$.
Rejection Sampling
Suppose we want to sample a complex distribution $p(z)$ what is easy to evaluate if we ignore
normalisation, i.e., such that $\tilde{p}(z) = C p(z), C > 0$ is easy to evaluate. Take some simple to evaluate distribution $q(z)$ (for example normal)
and find $k > 0$ such that $kq(z) > \tilde{p}(z)$
for all $z$. Then do uniform sampling on $\xi \sim U(0, kq(z_0))$ and accept if $\xi \leq \tilde{p}(z)$,
reject otherwise. The acceptance rate is proportional to $\frac{1}{k}$ so we want to have k
as small as possible.
Ridge Regression
Here I implemented the discussion in this stackexchange answer https://stats.stackexchange.com/a/151351
which shows the "Ridge" that we get with overfitting explicitly
Curve Fitting
Assuming we are fitting a curve with Gaussian Noise, then for any fixed x value the distribution
of the target value is given as follows (based on Figure 1.28 in Bishop, Pattern Recognition and Machine Learning, 2006)
PCA
We take our samples $X = {x_1, \dots, x_N} \subseteq \mathbb{R}^D$ and construct the covariance matrix $\Sigma_X = \sum_{i = 1}^N x_i^\top x_i$.
This is a symmetric positive definite matrix (assuming all samples are linearly independent) as such
it has $N$ eigenvectors $v_1, \dots, v_N$ and $n$ positive (!) eigenvalues $\lambda_1, \dots, \lambda_N$,
without loss of generality we can assume that $\lambda_1 \geq \lambda_2 \geq \dots \lambda_N$.
The first $k$ eigenvectors then give us the most important to extract the most important components
Effect of regularisation
Suppose we have a covariance matrix $\Sigma \in \mathbb{R}^{N \times N}$, we can regularise by
making the diagonal more dominant, applying so-called Tychonoff (or Tikhonov) Regularisation: $\Sigma_\alpha = \alpha I_{N \times N} + \Sigma$
The effect of this is it makes the underlying equicontours more spherical, in the sense that
the eigenvalues are closer together ($\lambda_i / \lambda_{i + 1}$ goes closer to 1)
Box Muller
A somewhat efficient algorithm for sample standard normal random variable using a transform on
uniforms distributed variables. Suppose we want to sample $2N$ standard normal variables.
First start with $X^{(0)} := {(x_i, y_i) : 1 \leq i \leq N}$ random variables such that $x_i, y_i \sim U(0, 1)$
and reject all samples where $r_i^2 := x_i^2 + y_i^2 > 1.0,$
giving us $X^{(1)} = {(x_i, y_i) \in X : r_i^2 \leq 1.0}.$
We then apply the transform $s_i^{(1)} = x_i \sqrt{\frac{-2\log(r_i^2)}{r_i^2}}$ and $s_i^{(2)} = y_i \sqrt{\frac{-2\log(r_i^2)}{r_i^2}}$ it can be shown that $s_i^{(1)}$ and $s_i^{(1)}$ are iid with $s_i^{(j)} \sim \mathcal{N}(0, 1)$.
This uses the rotational symmetry of the 2 dimensional normal distribution. We can see more accurately
what the transformation does by highlighting a particular slice
Integral Density
When doing Bayesian inference, if $p(\hat{\theta}|x)$ is very close to 1 for a particular $\hat{\theta}$
and close to $0$ for all others then $p(x|X) \approx p(X|\hat{\theta})$
This visualises this phenomonon
Gaussian Process
Normal
With added noise
Contribution Histograms
Heatmap of different Kernels
References
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
About
Implementation of different deep learning specific things