Maximum Likelihood Estimation

Maximum Likelihood Estimation (also known as MLE) is a technique used to determine the most likely parameters for a probability distribution. You can think of it as creating a joint probability distribution of all the possible probability distributions and finding the one that fits the subset of observed data the best.

Now let’s see a definition of how to setup an MLE for a generic function \(f(x)\) with \(n\) observed data points:

L(\theta) = \prod_{i = 1}^{n} f(x_i) \tag{1}

Usually this function can be all over the place and unwieldy, and if it is we can harness it with the power of the \(ln\)!

Now we can look at the first and second derivative of \(L(\theta)\) or of \(ln(L(\theta))\). Looking at the critical points of either of the first derivatives gives us a sense of where in the distribution we are maximizing our parameter within it’s parameter space \(\Omega\).

\frac{\partial L}{\partial \theta} = 0, \quad \frac{\partial^2 L}{\partial \theta^2} < 0,\quad\theta \in \Omega

Note, we want the second partial derivative to be less than zero by the second derivative test.

In the cases where we can’t get a closed form solution to \(\theta\), we can rely on order statistics^[1] methods. Take the function

f(x;\theta) = 1, \quad \theta \leq x \leq \theta + 1, \quad \theta \in \Omega = (-\infty, \infty) \tag{2}

for example. By taking the first derivative, we get

\frac{\partial L}{\partial \theta} = 0

but this means that the second derivative,

\frac{\partial^2 L}{\partial \theta^2} = 0

This does not mean that the MLE estimate point does not exist! It merely requires a different strategy to poke at the function and massage the \(\theta\) we need out of it.

What else do we know about this distribution? Well, we know that \(\theta \leq X_1, X_2, ..., X_n\) and also that \(X_1, X_2, ..., X_n \leq \theta + 1\). Using an ordered statistic, this implies that \(\hat{\theta} \leq X_{(1)}\)^[1] and \(X_{(n)} \leq \hat{\theta} + 1 \Rightarrow X_{(n)} - 1 \leq \hat{\theta}\). This leads us to

X_{(n)} - 1 \leq \hat{\theta} \leq X_{(1)} \tag{3}

This gives us our range for \(\hat{\theta}\) and thus our MLE of \(f(x)\).

Next, let’s look at a more complex function:

f(x;\theta) = \left\{ \begin{array}{ll} e^{\theta - x} & x \gt \theta\\ 0 & x \leq \theta \\ \end{array} \right. \tag{4}

Our MLE function here is:

\displaystyle L(\theta) = \prod_{i=1}^{n} e^{\theta - x_i} = e^{n \theta - \sum_{i} x} \tag{5}

\displaystyle \ell = ln(L(\theta)) = n \theta - \sum_{i} x_i \tag{6}

To maximize \(\ell\) we cannot simply look at the derivatives (as the first derivative is 0 only when \(n=0\)), so we need to look at the order statistic:

X_1, X_2, …, X_n \gt \theta \Rightarrow \theta \lt X_{(1)} \tag{7}

This function has some interesting implications depending on how you view it. From this perspective, we would say that no MLE exists for this function as \(\hat{\theta}\) must always be less than \(X_{(1)}\). If you change the bounds on \(\hat{\theta}\) to be \(x \geq \theta\), then the MLE of the function would be defined to be \(\hat{\theta} = X_{(1)}\).

From a post on MathOverflow^[2] looking at it from the perspective of the Radon-Nikodym^[3] theorem: “[A] continuous random variable with the above density function will be defined at \(x=\theta\), you cannot declare, by fiat, that \(x\) must be strictly greater than \(\theta\) if \(x\) is a continuous random variable with a density defined on the points greater than theta. As an example, lets say that we calculate the probabilty of \(x \geq \theta + \epsilon\) for \(\epsilon \gt 0\), then \(\lim_{\epsilon \to 0}P(x \geq \theta + \epsilon) = 1\), so the density will include __\(\theta\) in its domain so that you have a proper probability measure.” This appears to be due to the definition of probability densities as a Radon-Nikodym derivative (of the form \(\frac{d\nu}{d\mu}\), where \(\nu, \mu\) are \(\sigma\)-finite measures on \((X, \Sigma)\) and our \(f\) being \(\Sigma\)-measurable).

Even further confusion comes from this MathOverflow^[4] post which for the same function, and not utilizing Radon-Nikodym derivatives, he comes to the conclusion that indeed \(\hat{\theta} = \min X_i = X_{(1)}\) and going even farther to create an unbiased estimator for this density function.

This definitely makes it sound like there should in fact be an MLE due to this being a continuous density function. I still need to look further into this and try to formulate a proof either way based on a metric space for this particular density. I will update this post when I have dug some more into this.

For now, if you have any comments about this function or anything about the MLE in particular, feel free to leave one below!

Helpful References