<a target="_blank" href="https://colab.research.google.com/github/umangsoni22/cs670-project/blob/assignment-2/assignment2/assignment2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Gaussian Maximum Likelihood

##  MLE of a  Gaussian $p_{model}(x|w)$

You are given an array of data points called `data`. Your course site plots the negative log-likelihood  function for several candidate hypotheses. Estimate the parameters of the Gaussian $p_{model}$ by  coding an implementation that estimates its optimal parameters (15 points) and explaining what it does (10 points). You are free to use any Gradient-based optimization method you like.

### Solution Explanation

Since the dataset is small, we can calculate the batch gradient together
We would run 5000 iteration and update (increment) the optimal parameters (mean and variance) using learning_rate*batch_gradient.

We would use below reference equation for our optimization method

\begin{equation}
w_k_+_1 := w_k - \eta \cdot \nabla_w L(w_k)
\end{equation}

The log-likelihood function for the Gaussian distribution, given a set of observations x, and parameters $\mu$ and $\sigma^2$ is:

\begin{equation}
\log L(\mu, \sigma^2 | x) = \sum_{i=1}^{n} \left[ -\frac{1}{2} \log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right]
\end{equation}

where $x_i$ are the observed values and the sum is over all observations.

The partial derivatives of the log-likelihood function with respect to `μ` and `σ²` are:

1. With respect to `μ`:

\begin{equation}
\frac{\partial l(\mu, \sigma^2 | X)}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{N}(x_i - \mu) = 0
\end{equation}

2. With respect to `σ²`:

\begin{equation}
\frac{\partial l(\mu, \sigma^2 | X)}{\partial \sigma^2} = -\frac{1}{2\sigma^2}( -N + \frac{1}{\sigma^2}\sum_{i=1}^{N}(x_i - \mu)^2) = 0
\end{equation}


In [122]:
import numpy as np
data = [4, 5, 7, 8, 8, 9, 10, 5, 2, 3, 5, 4, 8, 9]

def ll_derivative_wrt_mean(x, mean, variance):
    """Calculate the derivative with respect to mean"""
    return np.sum([x_i - mean for x_i in x]) / variance

def ll_derivative_wrt_variance(x, mean, variance):
    """Calculate the derivative with respect to variance"""
    N = len(x)
    return 1 / (2 * variance) * ( -N + (1/variance)*(np.sum([(x_i - mean)**2 for x_i in x])))

n_epochs = 5000
t0, t1 = 50, 5000

def learning_rate(t):
    return t0 / (t+t1)

mean = variance = 1
x = data

## Running n_epoch iteration for calculating gradient
for epoch in range(n_epochs):

    # Calculate the gradients for the whole batch together
    d_mean = ll_derivative_wrt_mean(x, mean, variance)
    d_variance = ll_derivative_wrt_variance(x, mean, variance)

    # Update the parameters as explained in the solution
    mean += learning_rate(epoch) * d_mean
    variance += learning_rate(epoch) * d_variance

print(f"Optimal mean: {mean}, Optimal variance: {variance}")


Optimal mean: 6.214285714285691, Optimal variance: 5.881910450405248


## MLE of a conditional Gaussian $p_{model}(y|x,w)$

You are given a problem that involves the relationship between $x$ and $y$. Estimate the parameters of a $p_{model}$ that fit the dataset (x,y) shown below.   You are free to use any Gradient-based optimization method you like.  


## Solution

Using below equations for reference in optimizing the conditional gaussian model

The log-likelihood function for this Gaussian distribution, given a set of observations (x, y), and parameters a, b, and $\sigma^2$ is:

\begin{equation}
\log L(a, b, \sigma^2 | x, y) = \sum_{i=1}^{n} \left[ -\frac{1}{2} \log(2\pi\sigma^2) - \frac{(y_i - (a x_i + b))^2}{2\sigma^2} \right]
\end{equation}

where $y_i$ are the observed values, $x_i$ are the inputs, and the sum is over all observations.

The partial derivatives of the log-likelihood function with respect to `a`, `b`, and `σ²` are:

1. With respect to `a`:

\begin{equation}
\frac{\partial \log L}{\partial a} = \sum \frac{x (y - (ax + b))}{\sigma^2}
\end{equation}

2. With respect to `b`:

\begin{equation}
\frac{\partial \log L}{\partial b} = \sum \frac{(y - (ax + b))}{\sigma^2}
\end{equation}

3. With respect to `σ²`:

\begin{equation}
\frac{\partial \log L}{\partial \sigma^2} = \sum \left[ -\frac{1}{2\sigma^2}( -N + \frac{1}{\sigma^2}\sum_{i=1}^{N}(y - (ax + b))^2) \right]
\end{equation}

In [123]:
import numpy as np

# Given data
x = np.array([8, 16, 22, 33, 50, 51])
y = np.array([5, 20, 14, 32, 42, 58])

# Initial values
a = b = variance = 1
# Set learning rate and number of epochs
learning_rate = 0.01
epochs = 50000

# Define derivative functions
def derivative_a(x, y, a, b, variance):

    t_sum = 0
    for i in range(len(x)):
        t_sum += x[i] * (y[i] - (a * x[i] + b))

    return t_sum / variance

def derivative_b(x, y, a, b, variance):

    t_sum = 0
    for i in range(len(x)):
        t_sum += y[i] - (a * x[i] + b)
    return t_sum/variance

def derivative_variance(x, y, a, b, variance):
    t_sum = 0; N=len(x)
    for i in range(len(x)):
        t_sum += np.power(y[i] - (a * x[i] + b), 2)
    ans = (1 / (2 * variance)) * ( -N + (1/variance) * t_sum )
    return ans
    # return (1 / (2 * variance)) * (-N)

# Batch Gradient Descent
for _ in range(epochs):
    # print(a , b, variance)
    variance += learning_rate * derivative_variance(x, y, a, b, variance)
    a += learning_rate * derivative_a(x, y, a, b, variance)
    b += learning_rate * derivative_b(x, y, a, b, variance)

    # Ensure variance remains positive
    variance = max(variance, 1e-6)

print(f"Optimal a: {a}, Optimal b: {b}, Optimal variance: {variance}")


Optimal a: 1.0366003850912346, Optimal b: -2.5915990246152423, Optimal variance: 34.65335537389535
