Maximum Likelihood Estimate:

Decision boundary between two Gaussians

Detailed proof

The goal of this demonstration is to show that the decision boundary between two Gaussian models depends on the parameters of the distributions. In particular, we will establish that this boundary takes several forms: a straight line when the covariance matrices of the distributions are identical, and a conic such as a circle or an ellipse when these matrices differ. This analysis will help us better understand how Bayesian classification discriminates between classes based on the parameters of Gaussian distributions.

Reminder

The probability density function of a Gaussian model in 2D is:

\[ \mathcal{N}(X|\mu,\Sigma) = \frac{1}{(2\pi)^{\frac{D}{2}}} \frac{1}{|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T \Sigma_1^{-1} (x-\mu)\right) \]

In the context of Machine Learning, we have a set of random variables \(X_1,X_2,...,X_n\) distributed i.i.d. according to a Gaussian mixture (one Gaussian per class). We consider an i.i.d. sample \(x_1,x_2,...,x_n\) from these random variables. The aim is to associate each of these samples with a class using statistical models or supervised learning algorithms... The objective is to predict which class each x belongs to based on its features. This is done by training a model that learns to distinguish between classes from training data, with the goal of finding the most suitable class.

Here, we consider two classes, two Gaussians—in other words, their probability densities are:

\[ p_k(x) = \mathcal{N}(x \mid \mu_k, \Sigma_k), \quad k = 1, 2. \]

The decision boundary is the region where we cannot decide between the two models. Mathematically, and in the case of two Gaussian models, this corresponds to the equality of the probability density functions for the two models. If x lies on the decision boundary, we have:

\begin{align*} &\frac{1}{(2\pi)^{\frac{D}{2}}} \frac{1}{|\Sigma_1|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1)\right) = \frac{1}{(2\pi)^{\frac{D}{2}}} \frac{1}{|\Sigma_2|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2)\right) \end{align*}

Simplification of the equation

In this section, we'll simplify the equation to make it more workable in the next parts, where we'll tackle the different cases.

\begin{align*} &\frac{1}{(2\pi)^{\frac{D}{2}}} \frac{1}{|\Sigma_1|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1)\right) \\ &= \frac{1}{(2\pi)^{\frac{D}{2}}} \frac{1}{|\Sigma_2|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2)\right) \end{align*}

\[ \iff \]

\begin{align*} &\frac{1}{|\Sigma_1|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1)\right) \\ &= \frac{1}{|\Sigma_2|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2)\right) \end{align*}

\[ \overset{(1)}{\iff} \]

\begin{align*} &\log\left(\frac{1}{|\Sigma_1|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1)\right)\right) \\ &= \log\left(\frac{1}{|\Sigma_2|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2)\right)\right) \end{align*}

\[ \iff \]

\begin{align*} &\log\left(\frac{1}{|\Sigma_1|^{1/2}}\right) -\frac{1}{2}(x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1) \\ &= \log\left(\frac{1}{|\Sigma_2|^{1/2}}\right) -\frac{1}{2}(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2) \end{align*}

\[ \iff \]

\begin{align*} &\log(|\Sigma_1|) + (x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1) \\ &= \log(|\Sigma_2|) + (x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2) \end{align*}

\[ \iff \]

\begin{align*} &\log(\frac{|\Sigma_1|}{|\Sigma_2|}) = (x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2) - (x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1) \tag{2} \end{align*}

(1) is possible because \( \exp : \mathbb{R} \to \mathbb{R}_+ \) and \( |\Sigma_1| > 0 \), thus \( \frac{1}{|\Sigma_1|^{1/2}} > 0 \). Additionnally, the following function \( \log : \mathbb{R}_+ \to \mathbb{R} \) is monotonous and increasing in \( \mathbb{R}_+ \).

Disjunction of cases

1. Simple case \( \Sigma_i = \sigma_i \cdot I \):

This is equivalent to saying that the x variables are independent, and so their pairwise covariances are zero. In this case, it is clear that :

\[ |\Sigma_i| = \sigma_i^D, \ \Sigma_i^{-1} = \frac{1}{\sigma_i} I \]

Hence (2) can be simplified :

\[ (2) \] \[ \iff \]

\[ D(\log(\frac{\sigma_1}{\sigma_2})) = \frac{1}{\sigma_2}(x-\mu_2)^T(x-\mu_2) - \frac{1}{\sigma_1}(x-\mu_1)^T(x-\mu_1) \]

\[ \iff \]

\[ \frac{1}{\sigma_2}\|x-\mu_2\|^2 - \frac{1}{\sigma_1}\|x-\mu_1\|^2 - D(\log(\frac{\sigma_1}{\sigma_2})) = 0 \]

Which is a conic equation.

A conic equation is an algebraic equation that represents a conic section, which in geometry represents a curve obtained by the intersection of a plane with a cone of revolution. And if we intersect this cone with a plane we obtain an ellipse, a parabola or a hyperbola depending on the angle of the plane.

Mathematically, the equation of a conic is a quadratic polynomial in x and y :

\[ A u^2 + B u v + C v^2 + D u + E v + F = 0 \]

In our context, we're talking about the intersection of two Gaussian distributions, in which we're looking for the decision boundary where the probability densities are equal : \[ p_1(x) = p_2(x) \]

And as we saw at the beginning of this section, by taking the logarithm, this gives a quadratic equation in x, which can be a conic.

We can see this by expanding the quadratic function we saw above.

\[ \frac{1}{\sigma_2}\|x-\mu_2\|^2 - \frac{1}{\sigma_1}\|x-\mu_1\|^2 - D(\log(\frac{\sigma_1}{\sigma_2})) = 0 \]

By expanding the norms :

\[ \|x - \mu_1\|^2 =(x-\mu_1)^T(x-\mu_1) =x^T x - 2\mu_1^T x + \mu_1^T \mu_1 \]

\[ \|x - \mu_2\|^2 =(x-\mu_2)^T(x-\mu_2)= x^T x - 2\mu_2^T x + \mu_2^T \mu_2 \]

\[ \frac{1}{\sigma_2} (x^T x - 2\mu_2^T x + \mu_2^T \mu_2) - \frac{1}{\sigma_1} (x^T x - 2\mu_1^T x + \mu_1^T \mu_1) - D \log\left(\frac{\sigma_1}{\sigma_2}\right) = 0 \]

\[ \frac{1}{\sigma_2} x^T x - \frac{2}{\sigma_2} \mu_2^T x + \frac{\mu_2^T \mu_2}{\sigma_2} - \frac{1}{\sigma_1} x^T x + \frac{2}{\sigma_1} \mu_1^T x - \frac{\mu_1^T \mu_1}{\sigma_1} - D \log\left(\frac{\sigma_1}{\sigma_2}\right) = 0 \]

And we group terms of the same degree to obtain a conic function:

\[ \left(\frac{1}{\sigma_2} - \frac{1}{\sigma_1} \right) x^T x - 2 \left(\frac{\mu_2}{\sigma_2} - \frac{\mu_1}{\sigma_1} \right)^T x + \left(\frac{\mu_2^T \mu_2}{\sigma_2} - \frac{\mu_1^T \mu_1}{\sigma_1} - D \log \left(\frac{\sigma_1}{\sigma_2}\right) \right) = 0 \]

Here we consider \( D = 2 \): \( X \in \mathbb{R}^2 \) where \( X = \begin{bmatrix} X[1] \\\ X[2] \end{bmatrix} \) where \( v = X[1] \) and \( w = X[2] \).

The general equation of a conic in \(\mathbb{R}^2\) is:

\[ A u^2 + B u v + C v^2 + D u + E v+ F = 0 \]

And so, comparing term by term, we get:

Quadratic terms: \[ A = C = \frac{1}{\sigma_2} - \frac{1}{\sigma_1} \]

There is no term \(uv\), so \(B = 0\quad\)

Linear terms: \[ D = -2 \left( \frac{\mu_{2,x}}{\sigma_2} - \frac{\mu_{1,x}}{\sigma_1} \right), \quad E = -2 \left( \frac{\mu_{2,y}}{\sigma_2} - \frac{\mu_{1,y}}{\sigma_1} \right) \] where \(\mu_{1,x}\) and \(\mu_{1,y}\) are the coordinates of \(\mu_1\), and the same for \(\mu_2\).

Constant term: \[ F = \frac{\mu_2^T \mu_2}{\sigma_2} - \frac{\mu_1^T \mu_1}{\sigma_1} - D \log \left( \frac{\sigma_1}{\sigma_2} \right) \]

This confirms the conic nature of the decision boundary in this case.

1.1 The various cases depending on \( \sigma_1\) and \(\sigma_2 \)

If \(\sigma_1 = \sigma_2\)

Then, A = 0, so :

\[ \frac{1}{\sigma_2}\|x-\mu_2\|^2 - \frac{1}{\sigma_1}\|x-\mu_1\|^2 = A \]

\[ \iff \]

\[ \frac{1}{\sigma_2}\|x-\mu_2\|^2 - \frac{1}{\sigma_1}\|x-\mu_1\|^2 = 0 \]

\[ \iff \]

\[ \frac{1}{\sigma_2}\|x-\mu_2\|^2 = \frac{1}{\sigma_1}\|x-\mu_1\|^2 \]

\[ x\in B \iff \|x-\mu_2\| = \|x-\mu_1\| \]

\[ => x\in Bissectrice(\mu_1,\mu_2) \]

Therefore, in the case that \( \sigma_1 = \sigma_2 \), we have \( A = C = 0 \). The equation is linear, which means we have a straight line.

If \(\sigma_1 \neq \sigma_2 \)

In this case, it means that the random variables are independent but the variance is different between the two classes.

The Gaussian distribution equation becomes,

\[ \mathcal{N}(x|\mu, \sigma \cdot Id) = \frac{1}{2\pi^{d/2}} \cdot \frac{1}{\sigma^{d/2}} e^{-\frac{1}{2\sigma}(x-\mu)^T(x-\mu)} \]

\[ = \frac{1}{(2\pi)^{2/2}} \cdot \frac{1}{\sigma^{2/2}} e^{-\frac{1}{2\sigma}(x-\mu)^T(x-\mu)} \]

\[ = \frac{1}{2\pi} \cdot \frac{1}{\sigma} e^{-\frac{1}{2\sigma}(x-\mu)^T(x-\mu)} \]

We can then consider,

\[ \frac{1}{2\pi} \cdot \frac{1}{\sigma_1} e^{-\frac{1}{2\sigma_1}(x-\mu)^T(x-\mu)} = \frac{1}{2\pi} \cdot \frac{1}{\sigma_2} e^{-\frac{1}{2\sigma_2}(x-\mu)^T(x-\mu)} \]

After a few computations, we rearrange the terms,

\[ \frac{1}{\sigma_2}(x-\mu_2)^T(x-\mu_2)-\frac{1}{\sigma_1}(x-\mu_1)^T(x-\mu_1) + 2\log(\sigma_2)-2\log(\sigma_1) = 0 \]

The result is

\[ \frac{1}{\sigma_2}\|x-\mu_2\|^2 - \frac{1}{\sigma_1}\|x-\mu_1\|^2 + 2\log\left(\frac{\sigma_2}{\sigma_1}\right) = 0 \]

With no loss of generality, we can set ourselves in a coordinate system such that,

\[ \mu_1 = \begin{bmatrix}m_1\\ 0\end{bmatrix}, \quad \mu_2 = \begin{bmatrix}m_2\\ 0\end{bmatrix}, \quad x_A = \begin{bmatrix}x\\ 0\end{bmatrix} \]

We then have,

\[ \frac{1}{\sigma_2}(x-m_2)^2 - \frac{1}{\sigma_1}(x-m_1)^2 + 2\log\left(\frac{\sigma_2}{\sigma_1}\right) = 0 \]

This results in

\[ \left(\frac{1}{\sigma_2}-\frac{1}{\sigma_1}\right)x^2 - 2 \left(\frac{m_2}{\sigma_2}-\frac{m_1}{\sigma_1}\right)x + \left(\frac{m_2^2}{\sigma_2}-\frac{m_1^2}{\sigma_1}\right) + 2\log\left(\frac{\sigma_2}{\sigma_1}\right) = 0 \]

Examining this, we notice that it's a polynomial equation of degree 2 in one dimension.

But in two dimensions, the equation becomes :

\[ \left(\frac{1}{\sigma_2}-\frac{1}{\sigma_1}\right)x[1]^2 +\left(\frac{1}{\sigma_2}-\frac{1}{\sigma_1}\right)x[2]^2 - 2 \left(\frac{m_2}{\sigma_2}-\frac{m_1}{\sigma_1}\right)x[1] - 2 \left(\frac{m_2}{\sigma_2}-\frac{m_1}{\sigma_1}\right)x[2] + \left(\frac{m_2^2}{\sigma_2}-\frac{m_1^2}{\sigma_1}\right) + 2\log\left(\frac{\sigma_2}{\sigma_1}\right) = 0 \]

After further analysis, we see that \( A = C \neq 0 \), which means it's a circle.

The intersection with the x-axis is found by solving for \( x[1] \) by assuming \( x[2] = 0 \):

\[x_A^{\pm}[1] = \frac{-B \pm \sqrt{\Delta}}{2A} \]

Logically, \( x_A^{\pm}[2] = 0 \).

This also allows us to calculate the center of the circle with

\[ c = \begin{bmatrix}\frac{|x_A^+[1]- x_A^-[1]|}{2}\ 0\end{bmatrix} \]

and its radius,

\[ r = \frac{|x_A^+[1] - x_A^-[1]|}{2} \]

To find the intersections with the ordinate, we solve the equation for \( y \) by assuming \( x_B[1] = c_x = \frac{x_A^+[1] + x_A^-[1]}{2} \) :

\[ \log(\sigma_{11}\cdot\sigma_{12}) - \log(\sigma_{21}\cdot\sigma_{22}) = \left[\frac{(x_B[1]-m_1)^2}{\sigma_{11}} + \frac{x_B[2]^2}{\sigma_{12}}\right] - \left[\frac{(x_B-m_2)^2}{\sigma_{21}} + \frac{x_B[2]^2}{\sigma_{22}}\right] \]

\[ \Longrightarrow x_B^\pm[2] = \pm\sqrt{x_B[2]^2} \]

Les intersections avec les axes

2. Advanced case \( \Sigma_i \neq \sigma_i \cdot I \)

As in the previous section, we consider a coordinate system where \[\mu_1 = \begin{bmatrix}m_1\\0\end{bmatrix} \quad \text{et} \quad \mu_2 = \begin{bmatrix}m_2\\0\end{bmatrix}\] Let's return to equation \((2)\) of the decision boundary : \[log\left(\frac{|\Sigma_1|}{|\Sigma_2}|\right) = (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2) - (x-\mu_1)^T \Sigma_1^{-1}(x-\mu_1)\]

Let's assume \(x= \begin{bmatrix}x[1] \\ x[2]\end{bmatrix} \) and expand each term as follows : \[(x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) = \frac{(x[1] - m_i)^2}{\sigma_{i1}} + \frac{x[2]^2}{\sigma_{i2}}\]

By inserting into the previous equation, we obtain : \[ \log\left(\frac{\sigma_{11} \sigma_{12}}{\sigma_{21} \sigma_{22}}\right) = \frac{(x[1] - m_2)^2}{\sigma_{21}} + \frac{x[2]^2}{\sigma_{22}} - \frac{(x[1] - m_1)^2}{\sigma_{11}} - \frac{x[2]^2}{\sigma_{12}}\]

Now let's rearrange the terms, \[ \left( \frac{1}{\sigma_{21}} - \frac{1}{\sigma_{11}} \right) x[1]^2 + \left( \frac{1}{\sigma_{22}} - \frac{1}{\sigma_{12}} \right) x[2]^2 \] \[+ \left( \frac{-2m_2}{\sigma_{21}} + \frac{2m_1}{\sigma_{11}} \right) x[1] + \left( \frac{m_2^2}{\sigma_{21}} - \frac{m_1^2}{\sigma_{11}} - \log\left( \frac{\sigma_{11} \sigma_{12}}{\sigma_{21} \sigma_{22}} \right) \right) = 0 \]

The quadratic equation we obtain is that of a conic. We then notice that the coefficients of this equation, noted \((A, B, C, D, E, F)\), and consequently the shape of the decision boundary, are intricately tied to the \(sigma_{ij}\).

The precise shape of the conic — whether it is an ellipse, a parabola, or a hyperbola — depends on the sign of the discriminant \(\Delta = B^2 - 4AC\).

If \(\Delta = 0\), the equation represents a parabola (with \(A = 0\) or \(C = 0\)).
If \(\Delta > 0\), it describes a hyperbola (with \(A \cdot C < 0\)).
If \(\Delta < 0\), it describes an ellipse (with \(A \cdot C > 0\)).

For example, obtaining an ellipse is equivalent to fulfilling the following condition:

\[ \left( \frac{1}{\sigma_{21}} - \frac{1}{\sigma_{11}} \right) \cdot \left( \frac{1}{\sigma_{22}} - \frac{1}{\sigma_{12}} \right) > 0 \]

This matches a situation where variances in both directions are stretched or contracted consistently between the two classes, i.e. in the same direction. 

We can then determine the geometric parameters of the ellipse as follows:

By using \(x_2 = 0\), we solve the equation for \(x_1\): this gives the abscissae \(l_m\) and \(l_p\) of the horizontal ends of the ellipse.
The center of the ellipse is then located in \(C = \begin{bmatrix}\frac{l_m + l_p}{2}\\0\end{bmatrix}\).
By setting \(x_1 = x_c\) with \(x_c = \frac{l_m + l_p}{2}\), we solve the equation for \(x_2\): this gives the heights \(h_m\) and \(h_p\).

About

This project was carried out within the framework of a course at the University of Geneva. It aims to propose an interactive visualization of the decision frontier between two Gaussian distributions, highlighting the intuition underlying Bayesian classification. We heartily thank our supervisor for his guidance and precious support throughout the duration of the project.

Authors : Mohamed Zirar, Sofia Mae Padrigon, Ema Greganova

Supervisor : Stéphane Marchand-Maillet

License : MIT

Ressources

If you would like to delve even deeper into the subject, we would recommend the following book and websites :

Introduction & Visualization