# Deriving the Kalman Filter

October 05, 2020
Mark Liu

## Intro

Let’s use the random vector $x$ to represent an uncertain state, and the random vector $z$ to represent an uncertain measurement. Even before making any actual measurements, we should have prior idea of the likelihoods of different values of the combined vector $\begin{bmatrix} x\\ z \end{bmatrix}$. These are subjective assessments of the following sort.

• The value of $x$ is probably close to the known vector $a$
• The value of $z$ will probably turn out to be close to the known vector $b$
• The value of $z$ is probably close to $F x$, where $F$ is a known matrix

To encode these prior subjective beliefs numerically, we can say that $\begin{bmatrix} x\\ z \end{bmatrix}$ is distributed as a Gaussian random variable.

$\begin{bmatrix} x\\ z \end{bmatrix} \sim N(\begin{bmatrix} \mu_x\\ \mu_z \end{bmatrix}, \begin{bmatrix} \Sigma_{xx} & \Sigma_{xz}\\ \Sigma_{zx} & \Sigma_{zz} \end{bmatrix})$

• $\Sigma_{xx}$ describes how close we believe $x$ is to $\mu_x$.
• $\Sigma_{zz}$ describes how close we believe $z$ will be to $\mu_z$.
• $\Sigma_{xz} = \Sigma_{zx}^T$ describes how correlated we think $z$ and $x$ are.

The Kalman Filter can be viewed as a principled way to choose $\mu_x \, \mu_z \, \Sigma_{xx} \, \Sigma_{xz} \, \Sigma_{zx} \, \Sigma_{zz}$. There are also other ways to choose these priors, but suppose for now that we have chosen them sensibly.

Now that we have a prior $p(x,z)$, we can incorporate any measurement $z_0$ into the state by simply taking the posterior estimate $p(x|z=z_0)$. We will find in the next section that the posterior is $x|z=z_0$ distributed as a Gaussian with the following parameters.

$\mu_{x|z=z_0} = \mu_x + \Sigma_{xx}\Sigma_{xz}^{-1}(z_0 - \mu_z)$ $\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}$

I will call these the equations the Bayes Inference equations.

## Deriving the Bayes Inference Equations

In this section I’ll derive the Bayes inference equations. $\mu_{x|z=z_0} = \mu_x + \Sigma_{xx}\Sigma_{xz}^{-1}(z_0 - \mu_z)$ $\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}$

Feel free to come back to this section later if you’re willing to take these equations on faith for now.

We have the proportionality relationship $p(x|z=z_0) = \frac {p(x, z_0)} {p(z_0)}\propto p(x,z_0)$. This means only have to evaluate the right hand side $p(x,z_0)$ in order to know the distribution $p(x|z=z_0)$.

Remember the Gaussian density, where $K$ represents a constant of integration that we don’t care about. $p(x,z) = K\exp(-\frac{1}{2}\begin{bmatrix} x - \mu_x\\ z - \mu_z\end{bmatrix} ^T\begin{bmatrix} \Sigma_{xx} & \Sigma_{xz}\\ \Sigma_{zx} & \Sigma_{zz} \end{bmatrix}^{-1}\begin{bmatrix} x - \mu_x\\ z - \mu_z \end{bmatrix})$

It will be convenient to use the inverse covariance matrix, also known as the information matrix. $\begin{bmatrix} \Lambda_{xx} & \Lambda_{xz}\\ \Lambda_{zx} & \Lambda_{zz} \end{bmatrix}\equiv \begin{bmatrix} \Sigma_{xx} & \Sigma_{xz}\\ \Sigma_{zx} & \Sigma_{zz} \end{bmatrix}^{-1}$

We can substitute the information matrix and expand. $p(x,z) = K\exp(-\frac{1}{2}\begin{bmatrix} x\\ z \end{bmatrix}^T\begin{bmatrix} \Lambda_{xx} & \Lambda_{xz}\\ \Lambda_{zx} & \Lambda_{zz} \end{bmatrix}\begin{bmatrix} x\\ z \end{bmatrix} + \begin{bmatrix} x\\ z \end{bmatrix}^T \begin{bmatrix} \Lambda_{xx} & \Lambda_{xz}\\ \Lambda_{zx} & \Lambda_{zz} \end{bmatrix}\begin{bmatrix} \mu_x\\ \mu_z \end{bmatrix} )$

Then substitute $z = z_0$ and expand more. We can collect any terms that are not multiplied by $x$ into a constant $C$.
$p(x,z_0) = K\exp(-\frac{1}{2}x^T\Lambda_{xx}x - x^T\Lambda_{xy}z_0 + x^T \Lambda_{xx} \mu_x + x^T\Lambda_{xz}\mu_z + C)$

The $C$ and $K$ both drop out as scaling constants.
$p(x,z_0) \propto \exp(-\frac{1}{2}x^T\Lambda_{xx}x - x^T\Lambda_{xy}z_0 + x^T \Lambda_{xx} \mu_x + x^T\Lambda_{xz}\mu_z)$
$p(x,z_0) \propto \exp(-\frac{1}{2}x^T\Lambda_{xx}x + x^T(\Lambda_{xx} \mu_x - \Lambda_{xz}(z_0 - \mu_z))$

Complete the square by first by rewriting $\Lambda_{xx} \mu_x - \Lambda_{zz}(z_0 - \mu_z) \rightarrow \Lambda_{xx} (\mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z))$
$p(x,z_0) \propto \exp(-\frac{1}{2}x^T\Lambda_{xx}x + x^T\Lambda_{xx} (\mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z)))$
$p(x,z_0) \propto \exp(-\frac{1}{2} (x - (\mu_x - \Lambda_{xx}^{-1}\Lambda_{zz}(z_0 - \mu_z)))^T\Lambda_{xx}(x - (\mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z))))$

Note that this is the probability density of a Gaussian with mean $\mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z)$ and covariance $\Lambda_{xx}^{-1}$. $\mu_{x|z=z_0} = \mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z)$ $\Sigma_{x|z=z_0} = \Lambda_{xx}^{-1}$

This formula is written in terms of the information matrix, but in many cases it is more convenient to write it in terms of the covariance matrix. To accomplish this, we can use the block-matrix inversion formula where $\Sigma/\Sigma_{zz}$ is the Schur complement $\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}$.

$\begin{bmatrix} \Lambda_{xx} & \Lambda_{xz}\\ \Lambda_{zx} & \Lambda_{zz} \end{bmatrix} = \begin{bmatrix} \Sigma_{xx} & \Sigma_{xz}\\ \Sigma_{zx} & \Sigma_{zz} \end{bmatrix}^{-1} = \begin{bmatrix} (\Sigma/\Sigma_{zz})^{-1} & - (\Sigma/\Sigma_{zz})^{-1}\Sigma_{xz}\Sigma_{zz}^{-1} \\ \Sigma_{zz}^{-1}\Sigma_{zx}(\Sigma/\Sigma_{zz})^{-1} & \Sigma_{zz}^{-1} + \Sigma_{zz}^{-1}\Sigma_{zx} (\Sigma/\Sigma_{zz})^{-1}\Sigma_{xz}\Sigma_{zz}^{-1}\end{bmatrix}$

We see that $\Lambda_{xx}^{-1} = \Sigma/\Sigma_{zz}$ and $-\Lambda_{xx}^{-1} \Lambda_{xz} = \Sigma_{xx}\Sigma_{xz}^{-1}$. Therefore we can write the distribution of $x|_{z=z_0}$ in terms of the covariance matrix. $\mu_{x|z=z_0} = \mu_x + \Sigma_{xz}\Sigma_{zz}^{-1}(z_0 - \mu_z)$ $\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}$

## Deriving the Kalman Filter

In the first section, I mentioned that the Kalman Filter can be seen as a principled way to establish the priors $\mu_x, \, \mu_z, \, \Sigma_{xx}, \, \Sigma_{zz},\, \Sigma_{xz}$.

Remember we wanted these priors so that, given an actual measurement $z_0$, we could apply the Bayes inference equations. $\mu_{x|z=z_0} = \mu_x + \Sigma_{xz}\Sigma_{zz}^{-1}(z_0 - \mu_z)$ $\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}$

The Kalman Filter sets up $\mu_x, \, \mu_z, \, \Sigma_{xx}, \, \Sigma_{zz},\, \Sigma_{xz}$ by supposing that the state variable $x$ and the measurement variable $z$ are both caused by a single prior variable $x_0 \sim N(\mu_{x_0}, \Sigma_{x_0})$, via a state-update matrix $F$, and a measurement matrix $H$.

With $w \sim N(0, \Sigma_w)$ as independent process noise, we assume our state $x$ arises from $x_0$ follows. $x = F x_0 + w$

With $v \sim N(0, \Sigma_v)$ as independent measurement error, we assume our measurement $z$ arises from $x$ (and ultimately from $x_0$) as follows. $z = Hx + v$

These two equations are enough to generate the list $\mu_x, \, \mu_z, \, \Sigma_{xx}, \, \Sigma_{zz},\, \Sigma_{xz}$ via straightforward computations. See the next section for those derivations in detail.

We will end up with the following.
$\mu_x = F\mu_{x_0}$
$\mu_z = HF\mu_{x_0}$
$\Sigma_{xx} =F\Sigma_{x_0}F^T + \Sigma_w$
$\Sigma_{xz} = \Sigma_{xx}H^T$
$\Sigma_{zz} = H\Sigma_{xx}H^T + \Sigma_v$

That’s it! Now plug those values into the Bayes update rule and you have a Kalman Filter! $\mu_{x|z=z_0} = F\mu_{x_0}+ \Sigma_{xz}\Sigma_{zz}^{-1}(z_0 - \mu_z)$ $\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}$

A note on terminology for comparison to the Wikipedia article on Kalman Filter: $\mu_x$ is called the predicted mean $\Sigma_{xx}$ is called the predicted covariance $\Sigma_{zz}$ is called the innovation, or pre-fit residual covariance $\Sigma_{xz} \Sigma_{zz}^{-1} = \Sigma_{xx}H^T\Sigma_{zz}^{-1}$ is called the optimal Kalman Gain

## Deriving the Kalman Filter In Detail

In this section, I’ll show theses equalities.
$\mu_x = F\mu_{x_0}$
$\mu_z = HF\mu_{x_0}$
$\Sigma_{xx} =F\Sigma_{x_0}F^T + \Sigma_w$
$\Sigma_{xz} = \Sigma_{xx}H^T$
$\Sigma_{zz} = H\Sigma_{xx}H^T + \Sigma_v$

Here are the means.
$\mu_x = E[x] =E[Fx_0 + w] = FE[x_0] + E[w] = F\mu_{x_0} + 0 = F\mu_{x_0}$
$\mu_z = E[z] = E[Hx + v] = HE[x] + E[v] = H\mu_x + 0 = HF\mu_{x_0}$

Here are the covariances and cross covariance. It will be convenient to define the delta operator $\Delta$ which means $\Delta y = y - E[y]$. Also, for zero-mean variables like $v, \Delta v = v$.

$\Sigma_{xx} = E[\Delta x \Delta x^T]$
$= E[ (F \Delta x_0 + w)(F \Delta x_0 + w)^T]$
$= E[ (F \Delta x_0 \Delta x_0^T F^T)] + E[ F\Delta x w^T] + E[ w \Delta x^T F^T] + E[ w w^T]$
Use independence to distribute expectation in the second and third terms.
$= F E[\Delta x_0 \Delta x_0^T] F^T + E[ F\Delta x]E[ w^T] + E[ w]E[ \Delta x^T F^T] + E[ w w^T]$
$=F\Sigma_{x_0}F^T + 0 +0 + \Sigma_w$
$=F\Sigma_{x_0}F^T + \Sigma_w$

$\Sigma_{xz} = E[\Delta x \Delta z^T]$
$= E[\Delta x\Delta(Hx + v)^T]$
$= E[\Delta x (H \Delta x + v)^T]$
$= E[\Delta x \Delta x^T]H^T + E[\Delta x ]E[v^T]$
$=\Sigma_{xx} H^T + 0$
$=\Sigma_{xx} H^T$

$\Sigma_{zz} = E[\Delta z\Delta z^T]$
$= E[\Delta (Hx + v)\Delta (Hx + v)^T]$
$= E[(H\Delta x + v)(H\Delta x + v)^T]$
$= HE[\Delta x \Delta x^T]H^T + E[v]E[(H\Delta x+ v)^T] +E[H\Delta x+ v]E[v^T] + E[vv^T]$
$= H\Sigma_{xx} H^T + 0 + 0 + \Sigma_v$
$= H\Sigma_{xx} H^T + \Sigma_v$

© 2020 Biro Inc.