Autoregressive Models
The idea of autoregression is to use past values are used to predict future ones. The formalization differs between time series analysis and deep learning approaches.
Signal Processing & Time Series Analysis¶
See https://en.wikipedia.org/wiki/Autoregressive_model.
In the field of signal processing, autoregressive models describe time-varying processes, that are (weekly) stationary. A random variable \(X\) at time \(t\) is modelled as a weighted linear combination of its previous realizations and some noise:
ARMA Model¶
One common application of this formulation is in ARMA models, which combine the autoregressive part with a moving average for improved prediction:
Deep Learning¶
See https://deepgenerativemodels.github.io/notes/autoregressive/
![[Pasted image 20240809134938.png]]
In deep learning, autoregression is formulated as the factorization of a joint distribution into a chain of conditional distributions:
This is typically done in sequence modeling (e.g. RNNs). If we consider a binary model where the variables are represented in tabular form, the \(n\)-th variable will have \(n-1\) conditional variables. That means, we have \(2^{n-1}\) possible conditional configurations, for which we need to specify the conditional probability. The complexity of this naive representation thus is \(\mathcal{O}(2^{n})\).
Fully-Visible Sigmoid Belief Network¶
![[Pasted image 20240812105508.png]]
In the FVSBN, the \(i\)-th variable is represented as a parameterized linear combination of the \(i-1\) prior variables, normalized using Sigmoid:
Basically, the \(i\)-th variable is a weighted mean of the prior observations. This brings down the complexity to \(\mathcal{O}(n^{2})\) (\(n\) weights over \(n\) variable).
Neural Autoregressive Density Estimator¶

We can use the idea of FVSBNs, but parameterize the weights using a [[Multi-Layer Perceptron]]. This adds a hidden layer of size \(n\times d\), slightly increasing the complexity to \(\mathcal{O}(n^{2}d)\) (\(n\) weights over \(n\times d\) hidden layers). But if we constrain the
where \(\theta=\{W\in \mathbb{R}^{d\times n}, \mathbf{c} \in \mathbb{R}^d, \{\boldsymbol{\alpha}^{(i)}\in \mathbb{R}^d\}^n_{i=1}, \{b_i \in \mathbb{R}\}^n_{i=1}\}\) is the set of parameters. Since the input weights \(W\) are shared for all conditionals, the complexity is reduced to \(\mathcal{O}(nd)\).
A further extension computes means and variances for each of the conditionals and models the \(i\)-th variable as a mixture of Gaussians ([[Gaussian Mixture Model]]). This allows for modeling over real-valued data, thus this extension is called RNADE (Real-valued NADE).