Autoregressive Models

The idea of autoregression is to use past values are used to predict future ones. The formalization differs between time series analysis and deep learning approaches.

Signal Processing & Time Series Analysis¶

See https://en.wikipedia.org/wiki/Autoregressive_model.

In the field of signal processing, autoregressive models describe time-varying processes, that are (weekly) stationary. A random variable \(X\) at time \(t\) is modelled as a weighted linear combination of its previous realizations and some noise:

\[ X_{t}=\sum\limits_{i=1}^{p}\upvarphi_{i}X_{t-i}+\epsilon_{t} \]

ARMA Model¶

One common application of this formulation is in ARMA models, which combine the autoregressive part with a moving average for improved prediction:

Autoregressive Moving Average (ARMA)

Deep Learning¶

See https://deepgenerativemodels.github.io/notes/autoregressive/

![[Pasted image 20240809134938.png]]

In deep learning, autoregression is formulated as the factorization of a joint distribution into a chain of conditional distributions:

\[ q_{\theta}(x_{1}, ..., x_{T})=q_{\Theta}(x_{1})\prod\limits_{t=2}^{T}q_{\theta}(x_{t}|x_{1}, ..., x_{t-1}) \]

This is typically done in sequence modeling (e.g. RNNs). If we consider a binary model where the variables are represented in tabular form, the \(n\)-th variable will have \(n-1\) conditional variables. That means, we have \(2^{n-1}\) possible conditional configurations, for which we need to specify the conditional probability. The complexity of this naive representation thus is \(\mathcal{O}(2^{n})\).

Fully-Visible Sigmoid Belief Network¶

![[Pasted image 20240812105508.png]]

In the FVSBN, the \(i\)-th variable is represented as a parameterized linear combination of the \(i-1\) prior variables, normalized using Sigmoid:

\[ f_{i}(x_{<i})=\sigma(\alpha^{(i)}_{0}+\alpha^{(i)}_{1}x_{1}+\ldots+\alpha^{(i)}_{i-1}x_{i-1}) \]

Basically, the \(i\)-th variable is a weighted mean of the prior observations. This brings down the complexity to \(\mathcal{O}(n^{2})\) (\(n\) weights over \(n\) variable).

Neural Autoregressive Density Estimator¶

Pasted image 20240812105528.png

We can use the idea of FVSBNs, but parameterize the weights using a [[Multi-Layer Perceptron]]. This adds a hidden layer of size \(n\times d\), slightly increasing the complexity to \(\mathcal{O}(n^{2}d)\) (\(n\) weights over \(n\times d\) hidden layers). But if we constrain the

\[ \begin{align} \mathbf{h}_i &= \sigma(W_{., < i} \mathbf{x_{< i}} + \mathbf{c}) \\ f_i(x_1, x_2, \ldots, x_{i-1}) &= \sigma(\boldsymbol{\alpha}^{(i)}\mathbf{h}_i +b_i ) \end{align} \]

where \(\theta=\{W\in \mathbb{R}^{d\times n}, \mathbf{c} \in \mathbb{R}^d, \{\boldsymbol{\alpha}^{(i)}\in \mathbb{R}^d\}^n_{i=1}, \{b_i \in \mathbb{R}\}^n_{i=1}\}\) is the set of parameters. Since the input weights \(W\) are shared for all conditionals, the complexity is reduced to \(\mathcal{O}(nd)\).

A further extension computes means and variances for each of the conditionals and models the \(i\)-th variable as a mixture of Gaussians ([[Gaussian Mixture Model]]). This allows for modeling over real-valued data, thus this extension is called RNADE (Real-valued NADE).