Deriving the Kalman Filter

October 05, 2020
Mark Liu


Let’s use the random vector xx to represent an uncertain state, and the random vector zz to represent an uncertain measurement. Even before making any actual measurements, we should have prior idea of the likelihoods of different values of the combined vector [xz]\begin{bmatrix} x\\ z \end{bmatrix}. These are subjective assessments of the following sort.

  • The value of xx is probably close to the known vector aa
  • The value of zz will probably turn out to be close to the known vector bb
  • The value of zz is probably close to FxF x, where FF is a known matrix

To encode these prior subjective beliefs numerically, we can say that [xz]\begin{bmatrix} x\\ z \end{bmatrix} is distributed as a Gaussian random variable.

[xz]N([μxμz],[ΣxxΣxzΣzxΣzz])\begin{bmatrix} x\\ z \end{bmatrix} \sim N(\begin{bmatrix} \mu_x\\ \mu_z \end{bmatrix}, \begin{bmatrix} \Sigma_{xx} & \Sigma_{xz}\\ \Sigma_{zx} & \Sigma_{zz} \end{bmatrix})

  • Σxx\Sigma_{xx} describes how close we believe xx is to μx\mu_x.
  • Σzz\Sigma_{zz} describes how close we believe zz will be to μz\mu_z.
  • Σxz=ΣzxT\Sigma_{xz} = \Sigma_{zx}^T describes how correlated we think zz and xx are.

The Kalman Filter can be viewed as a principled way to choose μxμzΣxxΣxzΣzxΣzz\mu_x \, \mu_z \, \Sigma_{xx} \, \Sigma_{xz} \, \Sigma_{zx} \, \Sigma_{zz}. There are also other ways to choose these priors, but suppose for now that we have chosen them sensibly.

Now that we have a prior p(x,z)p(x,z), we can incorporate any measurement z0z_0 into the state by simply taking the posterior estimate p(xz=z0)p(x|z=z_0). We will find in the next section that the posterior is xz=z0x|z=z_0 distributed as a Gaussian with the following parameters.

μxz=z0=μx+ΣxxΣxz1(z0μz)\mu_{x|z=z_0} = \mu_x + \Sigma_{xx}\Sigma_{xz}^{-1}(z_0 - \mu_z) Σxz=z0=ΣxxΣxzΣzz1Σzx\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}

I will call these the equations the Bayes Inference equations.

Deriving the Bayes Inference Equations

In this section I’ll derive the Bayes inference equations. μxz=z0=μx+ΣxxΣxz1(z0μz)\mu_{x|z=z_0} = \mu_x + \Sigma_{xx}\Sigma_{xz}^{-1}(z_0 - \mu_z) Σxz=z0=ΣxxΣxzΣzz1Σzx\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}

Feel free to come back to this section later if you’re willing to take these equations on faith for now.

We have the proportionality relationship p(xz=z0)=p(x,z0)p(z0)p(x,z0)p(x|z=z_0) = \frac {p(x, z_0)} {p(z_0)}\propto p(x,z_0). This means only have to evaluate the right hand side p(x,z0)p(x,z_0) in order to know the distribution p(xz=z0)p(x|z=z_0).

Remember the Gaussian density, where KK represents a constant of integration that we don’t care about. p(x,z)=Kexp(12[xμxzμz]T[ΣxxΣxzΣzxΣzz]1[xμxzμz])p(x,z) = K\exp(-\frac{1}{2}\begin{bmatrix} x - \mu_x\\ z - \mu_z\end{bmatrix} ^T\begin{bmatrix} \Sigma_{xx} & \Sigma_{xz}\\ \Sigma_{zx} & \Sigma_{zz} \end{bmatrix}^{-1}\begin{bmatrix} x - \mu_x\\ z - \mu_z \end{bmatrix})

It will be convenient to use the inverse covariance matrix, also known as the information matrix. [ΛxxΛxzΛzxΛzz][ΣxxΣxzΣzxΣzz]1\begin{bmatrix} \Lambda_{xx} & \Lambda_{xz}\\ \Lambda_{zx} & \Lambda_{zz} \end{bmatrix}\equiv \begin{bmatrix} \Sigma_{xx} & \Sigma_{xz}\\ \Sigma_{zx} & \Sigma_{zz} \end{bmatrix}^{-1}

We can substitute the information matrix and expand. p(x,z)=Kexp(12[xz]T[ΛxxΛxzΛzxΛzz][xz]+[xz]T[ΛxxΛxzΛzxΛzz][μxμz])p(x,z) = K\exp(-\frac{1}{2}\begin{bmatrix} x\\ z \end{bmatrix}^T\begin{bmatrix} \Lambda_{xx} & \Lambda_{xz}\\ \Lambda_{zx} & \Lambda_{zz} \end{bmatrix}\begin{bmatrix} x\\ z \end{bmatrix} + \begin{bmatrix} x\\ z \end{bmatrix}^T \begin{bmatrix} \Lambda_{xx} & \Lambda_{xz}\\ \Lambda_{zx} & \Lambda_{zz} \end{bmatrix}\begin{bmatrix} \mu_x\\ \mu_z \end{bmatrix} )

Then substitute z=z0z = z_0 and expand more. We can collect any terms that are not multiplied by xx into a constant CC.
p(x,z0)=Kexp(12xTΛxxxxTΛxyz0+xTΛxxμx+xTΛxzμz+C)p(x,z_0) = K\exp(-\frac{1}{2}x^T\Lambda_{xx}x - x^T\Lambda_{xy}z_0 + x^T \Lambda_{xx} \mu_x + x^T\Lambda_{xz}\mu_z + C)

The CC and KK both drop out as scaling constants.
p(x,z0)exp(12xTΛxxxxTΛxyz0+xTΛxxμx+xTΛxzμz)p(x,z_0) \propto \exp(-\frac{1}{2}x^T\Lambda_{xx}x - x^T\Lambda_{xy}z_0 + x^T \Lambda_{xx} \mu_x + x^T\Lambda_{xz}\mu_z)
p(x,z0)exp(12xTΛxxx+xT(ΛxxμxΛxz(z0μz))p(x,z_0) \propto \exp(-\frac{1}{2}x^T\Lambda_{xx}x + x^T(\Lambda_{xx} \mu_x - \Lambda_{xz}(z_0 - \mu_z))

Complete the square by first by rewriting ΛxxμxΛzz(z0μz)Λxx(μxΛxx1Λxz(z0μz))\Lambda_{xx} \mu_x - \Lambda_{zz}(z_0 - \mu_z) \rightarrow \Lambda_{xx} (\mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z))
p(x,z0)exp(12xTΛxxx+xTΛxx(μxΛxx1Λxz(z0μz)))p(x,z_0) \propto \exp(-\frac{1}{2}x^T\Lambda_{xx}x + x^T\Lambda_{xx} (\mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z)))
p(x,z0)exp(12(x(μxΛxx1Λzz(z0μz)))TΛxx(x(μxΛxx1Λxz(z0μz))))p(x,z_0) \propto \exp(-\frac{1}{2} (x - (\mu_x - \Lambda_{xx}^{-1}\Lambda_{zz}(z_0 - \mu_z)))^T\Lambda_{xx}(x - (\mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z))))

Note that this is the probability density of a Gaussian with mean μxΛxx1Λxz(z0μz)\mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z) and covariance Λxx1\Lambda_{xx}^{-1}. μxz=z0=μxΛxx1Λxz(z0μz)\mu_{x|z=z_0} = \mu_x - \Lambda_{xx}^{-1}\Lambda_{xz}(z_0 - \mu_z) Σxz=z0=Λxx1\Sigma_{x|z=z_0} = \Lambda_{xx}^{-1}

This formula is written in terms of the information matrix, but in many cases it is more convenient to write it in terms of the covariance matrix. To accomplish this, we can use the block-matrix inversion formula where Σ/Σzz\Sigma/\Sigma_{zz} is the Schur complement ΣxxΣxzΣzz1Σzx\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}.

[ΛxxΛxzΛzxΛzz]=[ΣxxΣxzΣzxΣzz]1=[(Σ/Σzz)1(Σ/Σzz)1ΣxzΣzz1Σzz1Σzx(Σ/Σzz)1Σzz1+Σzz1Σzx(Σ/Σzz)1ΣxzΣzz1]\begin{bmatrix} \Lambda_{xx} & \Lambda_{xz}\\ \Lambda_{zx} & \Lambda_{zz} \end{bmatrix} = \begin{bmatrix} \Sigma_{xx} & \Sigma_{xz}\\ \Sigma_{zx} & \Sigma_{zz} \end{bmatrix}^{-1} = \begin{bmatrix} (\Sigma/\Sigma_{zz})^{-1} & - (\Sigma/\Sigma_{zz})^{-1}\Sigma_{xz}\Sigma_{zz}^{-1} \\ \Sigma_{zz}^{-1}\Sigma_{zx}(\Sigma/\Sigma_{zz})^{-1} & \Sigma_{zz}^{-1} + \Sigma_{zz}^{-1}\Sigma_{zx} (\Sigma/\Sigma_{zz})^{-1}\Sigma_{xz}\Sigma_{zz}^{-1}\end{bmatrix}

We see that Λxx1=Σ/Σzz\Lambda_{xx}^{-1} = \Sigma/\Sigma_{zz} and Λxx1Λxz=ΣxxΣxz1-\Lambda_{xx}^{-1} \Lambda_{xz} = \Sigma_{xx}\Sigma_{xz}^{-1}. Therefore we can write the distribution of xz=z0x|_{z=z_0} in terms of the covariance matrix. μxz=z0=μx+ΣxzΣzz1(z0μz)\mu_{x|z=z_0} = \mu_x + \Sigma_{xz}\Sigma_{zz}^{-1}(z_0 - \mu_z) Σxz=z0=ΣxxΣxzΣzz1Σzx\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}

Deriving the Kalman Filter

In the first section, I mentioned that the Kalman Filter can be seen as a principled way to establish the priors μx,μz,Σxx,Σzz,Σxz\mu_x, \, \mu_z, \, \Sigma_{xx}, \, \Sigma_{zz},\, \Sigma_{xz}.

Remember we wanted these priors so that, given an actual measurement z0z_0, we could apply the Bayes inference equations. μxz=z0=μx+ΣxzΣzz1(z0μz)\mu_{x|z=z_0} = \mu_x + \Sigma_{xz}\Sigma_{zz}^{-1}(z_0 - \mu_z) Σxz=z0=ΣxxΣxzΣzz1Σzx\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}

The Kalman Filter sets up μx,μz,Σxx,Σzz,Σxz\mu_x, \, \mu_z, \, \Sigma_{xx}, \, \Sigma_{zz},\, \Sigma_{xz} by supposing that the state variable xx and the measurement variable zz are both caused by a single prior variable x0N(μx0,Σx0)x_0 \sim N(\mu_{x_0}, \Sigma_{x_0}), via a state-update matrix FF, and a measurement matrix HH.

With wN(0,Σw)w \sim N(0, \Sigma_w) as independent process noise, we assume our state xx arises from x0x_0 follows. x=Fx0+wx = F x_0 + w

With vN(0,Σv)v \sim N(0, \Sigma_v) as independent measurement error, we assume our measurement zz arises from xx (and ultimately from x0x_0) as follows. z=Hx+vz = Hx + v

These two equations are enough to generate the list μx,μz,Σxx,Σzz,Σxz\mu_x, \, \mu_z, \, \Sigma_{xx}, \, \Sigma_{zz},\, \Sigma_{xz} via straightforward computations. See the next section for those derivations in detail.

We will end up with the following.
μx=Fμx0\mu_x = F\mu_{x_0}
μz=HFμx0\mu_z = HF\mu_{x_0}
Σxx=FΣx0FT+Σw\Sigma_{xx} =F\Sigma_{x_0}F^T + \Sigma_w
Σxz=ΣxxHT\Sigma_{xz} = \Sigma_{xx}H^T
Σzz=HΣxxHT+Σv\Sigma_{zz} = H\Sigma_{xx}H^T + \Sigma_v

That’s it! Now plug those values into the Bayes update rule and you have a Kalman Filter! μxz=z0=Fμx0+ΣxzΣzz1(z0μz)\mu_{x|z=z_0} = F\mu_{x_0}+ \Sigma_{xz}\Sigma_{zz}^{-1}(z_0 - \mu_z) Σxz=z0=ΣxxΣxzΣzz1Σzx\Sigma_{x|z=z_0} =\Sigma_{xx} - \Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}

A note on terminology for comparison to the Wikipedia article on Kalman Filter: μx\mu_x is called the predicted mean Σxx\Sigma_{xx} is called the predicted covariance Σzz\Sigma_{zz} is called the innovation, or pre-fit residual covariance ΣxzΣzz1=ΣxxHTΣzz1\Sigma_{xz} \Sigma_{zz}^{-1} = \Sigma_{xx}H^T\Sigma_{zz}^{-1} is called the optimal Kalman Gain

Deriving the Kalman Filter In Detail

In this section, I’ll show theses equalities.
μx=Fμx0\mu_x = F\mu_{x_0}
μz=HFμx0\mu_z = HF\mu_{x_0}
Σxx=FΣx0FT+Σw\Sigma_{xx} =F\Sigma_{x_0}F^T + \Sigma_w
Σxz=ΣxxHT\Sigma_{xz} = \Sigma_{xx}H^T
Σzz=HΣxxHT+Σv\Sigma_{zz} = H\Sigma_{xx}H^T + \Sigma_v

Here are the means.
μx=E[x]=E[Fx0+w]=FE[x0]+E[w]=Fμx0+0=Fμx0\mu_x = E[x] =E[Fx_0 + w] = FE[x_0] + E[w] = F\mu_{x_0} + 0 = F\mu_{x_0}
μz=E[z]=E[Hx+v]=HE[x]+E[v]=Hμx+0=HFμx0\mu_z = E[z] = E[Hx + v] = HE[x] + E[v] = H\mu_x + 0 = HF\mu_{x_0}

Here are the covariances and cross covariance. It will be convenient to define the delta operator Δ\Delta which means Δy=yE[y]\Delta y = y - E[y]. Also, for zero-mean variables like v,Δv=vv, \Delta v = v.

Σxx=E[ΔxΔxT]\Sigma_{xx} = E[\Delta x \Delta x^T]
=E[(FΔx0+w)(FΔx0+w)T]= E[ (F \Delta x_0 + w)(F \Delta x_0 + w)^T]
=E[(FΔx0Δx0TFT)]+E[FΔxwT]+E[wΔxTFT]+E[wwT]= E[ (F \Delta x_0 \Delta x_0^T F^T)] + E[ F\Delta x w^T] + E[ w \Delta x^T F^T] + E[ w w^T]
Use independence to distribute expectation in the second and third terms.
=FE[Δx0Δx0T]FT+E[FΔx]E[wT]+E[w]E[ΔxTFT]+E[wwT]= F E[\Delta x_0 \Delta x_0^T] F^T + E[ F\Delta x]E[ w^T] + E[ w]E[ \Delta x^T F^T] + E[ w w^T]
=FΣx0FT+0+0+Σw=F\Sigma_{x_0}F^T + 0 +0 + \Sigma_w
=FΣx0FT+Σw=F\Sigma_{x_0}F^T + \Sigma_w

Σxz=E[ΔxΔzT]\Sigma_{xz} = E[\Delta x \Delta z^T]
=E[ΔxΔ(Hx+v)T]= E[\Delta x\Delta(Hx + v)^T]
=E[Δx(HΔx+v)T]= E[\Delta x (H \Delta x + v)^T]
=E[ΔxΔxT]HT+E[Δx]E[vT]= E[\Delta x \Delta x^T]H^T + E[\Delta x ]E[v^T]
=ΣxxHT+0=\Sigma_{xx} H^T + 0
=ΣxxHT=\Sigma_{xx} H^T

Σzz=E[ΔzΔzT]\Sigma_{zz} = E[\Delta z\Delta z^T]
=E[Δ(Hx+v)Δ(Hx+v)T]= E[\Delta (Hx + v)\Delta (Hx + v)^T]
=E[(HΔx+v)(HΔx+v)T]= E[(H\Delta x + v)(H\Delta x + v)^T]
=HE[ΔxΔxT]HT+E[v]E[(HΔx+v)T]+E[HΔx+v]E[vT]+E[vvT]= HE[\Delta x \Delta x^T]H^T + E[v]E[(H\Delta x+ v)^T] +E[H\Delta x+ v]E[v^T] + E[vv^T]
=HΣxxHT+0+0+Σv= H\Sigma_{xx} H^T + 0 + 0 + \Sigma_v
=HΣxxHT+Σv= H\Sigma_{xx} H^T + \Sigma_v

© 2020 Biro Inc.