Introduction

When I took Machine Learning in College, I never felt that online resources described the mathematical foundations to my tastes. To try and fix this, in this post I’ll explore the mathematical foundations underlying neural networks and walk through my from-scratch implementation for binary classification on the moons dataset. We’ll dive deep into the calculus, linear algebra, and optimization theory that makes these models work, culminating in a network that achieves 99.6% accuracy on the moons dataset.

Mathematical Framework

Universal Approximation and Function Representation

Neural networks are universal function approximators. For our single hidden layer network, we’re constructing a function $f: \mathbb{R}^n \to \mathbb{R}$ of the form:

$$f(\mathbf{x}; \boldsymbol{\theta}) = \sum_{j=1}^{m} w_j \sigma\left(\mathbf{w}_j^{(1)T} \mathbf{x} + b_j^{(1)}\right) + b^{(2)}$$

Where:

  • $\mathbf{x} \in \mathbb{R}^n$ is the input vector
  • $m$ is the number of hidden units (150 in our implementation)
  • $\sigma(\cdot)$ is the activation function (ReLU)
  • $\boldsymbol{\theta} = {W^{(1)}, \mathbf{b}^{(1)}, \mathbf{w}^{(2)}, b^{(2)}}$ are the parameters
  • $W^{(1)} \in \mathbb{R}^{m \times n}$ is the first layer weight matrix
  • $\mathbf{w}^{(2)} \in \mathbb{R}^m$ is the output layer weight vector

Forward Propagation Mathematics

The forward pass computes a composition of linear transformations and nonlinear activations:

$$\mathbf{a}^{(1)} = W^{(1)}\mathbf{x} + \mathbf{b}^{(1)} \in \mathbb{R}^m$$

$$\mathbf{h}^{(1)} = \text{ReLU}(\mathbf{a}^{(1)}) = \max(0, \mathbf{a}^{(1)}) \in \mathbb{R}^m$$

$$z = (\mathbf{w}^{(2)})^T\mathbf{h}^{(1)} + b^{(2)} \in \mathbb{R}$$

$$\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}} \in (0,1)$$

The Rectified Linear Unit (ReLU) activation function introduces nonlinearity:

$$ \text{ReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} $$

Its derivative is the Heaviside step function:

$$ \text{ReLU}’(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} $$

Loss Function and Numerical Stability

Binary Cross-Entropy Loss

For binary classification, we use the cross-entropy loss function. Given true labels $y_i \in {0,1}$ and predictions $\hat{y}_i \in (0,1)$:

$$\mathcal{L}(\boldsymbol{\theta}) = -\frac{1}{N}\sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]$$

Substituting $\hat{y}_i = \sigma(z_i)$:

$$\mathcal{L}(\boldsymbol{\theta}) = -\frac{1}{N}\sum_{i=1}^{N} \left[ y_i \log(\sigma(z_i)) + (1-y_i) \log(1-\sigma(z_i)) \right]$$

Numerical Stability

The naive computation of $\log(\sigma(z))$ suffers from numerical instability. When $z$ is large and positive, $e^{-z} \approx 0$, so $\sigma(z) \approx 1$ and $\log(\sigma(z)) \approx 0$. However, when $z$ is large and negative, $e^{-z} \to \infty$, causing $\sigma(z) \to 0$ and $\log(\sigma(z}) \to -\infty$.

The mathematically equivalent but numerically stable formulations are:

For $\log(\sigma(z))$: $$\log(\sigma(z)) = \log\left(\frac{1}{1+e^{-z}}\right) = -\log(1+e^{-z})$$

But we need to handle both cases:

$$ \log(\sigma(z)) = \begin{cases} z - \log(1 + e^z) & \text{if } z < 0 \ -\log(1 + e^{-z}) & \text{if } z \geq 0 \end{cases} $$

For $\log(1-\sigma(z))$: $$\log(1-\sigma(z)) = \log\left(\frac{e^{-z}}{1+e^{-z}}\right) = -z - \log(1+e^{-z})$$

Similarly:

$$ \log(1-\sigma(z)) = \begin{cases} -\log(1 + e^z) & \text{if } z < 0 \ -z - \log(1 + e^{-z}) & \text{if } z \geq 0 \end{cases} $$

This is implemented in my mylog and mylogminus functions.

def sigmoid(t):
    return 1.0/(1 + np.exp(-t))

def loss(y, z):
    # compute loss(y, sigmoid(z))
    # y --> label
    # z --> model output

    loss = np.mean(-y*(mylog(z)) - (1-y)*mylogminus(z))
    return loss

def mylog(t):
    # compute log(sigmoid(t)) to be used in loss

    y = 0*t
    m=t.shape[0]
    for i in range(m):
      if t[i] < 0:
        y[i] = t[i]-np.log(1+np.exp(t[i]))
      else:
        y[i] = -np.log(1+np.exp(-t[i]))

    return y

def mylogminus(t):
    # compute log(1-sigmoid(t)) to be used in loss

    y = 0*t
    m=t.shape[0]
    for i in range(m):
      if t[i] < 0:
        y[i] = -np.log(1+np.exp(t[i]))
      else:
        y[i] = -t[i]- np.log(1+np.exp(-t[i]))

    return y

Backpropagation Mathematics

Backpropagation applies the multivariable chain rule to compute gradients. For our loss function $\mathcal{L}(\boldsymbol{\theta})$, we need:

$$\frac{\partial \mathcal{L}}{\partial W^{(1)}}, \quad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}}, \quad \frac{\partial \mathcal{L}}{\partial \mathbf{w}^{(2)}}, \quad \frac{\partial \mathcal{L}}{\partial b^{(2)}}$$

Output Layer Gradients

Starting from the output, for a single sample:

$$\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z}$$

The cross-entropy loss derivative is: $$\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$$

The sigmoid derivative is: $$\frac{\partial \hat{y}}{\partial z} = \sigma(z)(1-\sigma(z)) = \hat{y}(1-\hat{y})$$

Combining these gives the remarkably clean result: $$\frac{\partial \mathcal{L}}{\partial z} = \hat{y} - y$$

This is the fundamental gradient that drives learning in logistic regression and neural networks.

For the output layer parameters: $$\frac{\partial z}{\partial \mathbf{w}^{(2)}} = \mathbf{h}^{(1)}, \quad \frac{\partial z}{\partial b^{(2)}} = 1$$

Therefore: $$\frac{\partial \mathcal{L}}{\partial \mathbf{w}^{(2)}} = (\hat{y} - y)\mathbf{h}^{(1)}, \quad \frac{\partial \mathcal{L}}{\partial b^{(2)}} = \hat{y} - y$$

Hidden Layer Gradients

For the hidden layer, we apply the chain rule through the ReLU activation:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} = \frac{\partial \mathcal{L}}{\partial z} \frac{\partial z}{\partial \mathbf{h}^{(1)}} = (\hat{y} - y)\mathbf{w}^{(2)}$$

$$\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} \odot \frac{\partial \mathbf{h}^{(1)}}{\partial \mathbf{a}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} \odot \text{ReLU}’(\mathbf{a}^{(1)})$$

Where $\odot$ denotes element-wise multiplication.

Finally: $$\frac{\partial \mathcal{L}}{\partial W^{(1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(1)}} \mathbf{x}^T$$

$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(1)}}$$

Implementation Details

def gradients(w, b, W_one, b_one, X, y):
    N = X.shape[0]
    # Initialize gradients
    dw = np.zeros(w.shape)
    db = 0
    dW_one = np.zeros(W_one.shape)
    db_one = np.zeros_like(b_one)

    # Forward pass
    z = model(w, b, W_one, b_one, X)

    # Backward pass for each sample
    for i in range(N):
        x_i = X[i].reshape(1, -1)
        y_i = y[i]

        # Forward pass for sample i (to get intermediate values)
        a_one_i = x_i @ W_one.T + b_one
        h_one_i = ReLU(a_one_i)
        sig_i = sigmoid(z[i])

        # Output layer: ∂L/∂z = σ(z) - y
        dz_i = sig_i - y_i

        # Output layer gradients: ∂L/∂w = (σ(z) - y) * h, ∂L/∂b = σ(z) - y
        dw += dz_i * h_one_i[0]
        db += dz_i

        # Hidden layer: ∂L/∂h = (σ(z) - y) * w
        dh_one_i = dz_i * w
        # ∂L/∂a = ∂L/∂h ⊙ ReLU'(a)
        da_one_i = dh_one_i * ReLUprime(a_one_i)

        # Hidden layer gradients: ∂L/∂W = ∂L/∂a * x^T, ∂L/∂b = ∂L/∂a
        dW_one += np.outer(da_one_i[0], x_i[0])
        db_one += da_one_i[0]

    # Average over batch
    return z, dw/N, db/N, dW_one/N, db_one/N

Optimization Theory

Gradient Descent Algorithm

We minimize the loss using gradient descent:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(t)})$$

Where $\eta$ is the learning rate. The convergence properties depend on the loss landscape and the choice of $\eta$.

Convergence

For convex functions, gradient descent with appropriate learning rate $\eta \leq \frac{1}{L}$ (where $L$ is the Lipschitz constant of the gradient) guarantees convergence. Neural networks create non-convex loss surfaces, but empirically, gradient descent finds good local minima.

The learning rate requires careful tuning:

  • Too large: $\eta > \frac{2}{L}$ can cause divergence
  • Too small: Very slow convergence
  • Just right: Fast convergence to good local minimum

Theoretical Foundation: Why This Works

Representation Power

The key insight is that ReLU networks create piecewise-linear decision boundaries. With $m$ hidden units, we can create up to $m$ “kinks” in our decision boundary. Each hidden unit contributes a half-space:

$$h_j = \max(0, \mathbf{w}_j^T\mathbf{x} + b_j)$$

The final output is a weighted combination of these half-spaces, allowing complex decision boundaries.

Approximation Theory

Universal approximation theorem states that a feedforward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of $\mathbb{R}^n$ to arbitrary accuracy, provided the activation function is non-constant, bounded, and monotonically-increasing.

ReLU satisfies a weaker version: it can approximate any continuous piecewise-linear function exactly, and can approximate smooth functions to arbitrary accuracy with sufficient width.

Results on Moons Dataset

Dataset Characteristics

The moons dataset presents a classic non-linearly separable problem. I used it before for when I did logistic regression from scratch. In the moons, the decision boundary is a curved line separating two interlocked crescents. This requires the network to learn a nonlinear mapping.

Performance Metrics

  • Training Accuracy: 99.6%
  • Test Accuracy: 99.6%
  • Final Loss: < 0.1
  • Convergence: Achieved within reasonable iterations

Analysis

The excellent test performance (matching training performance) indicates:

  1. Sufficient Model Capacity: 150 hidden units provide enough expressiveness
  2. Good Generalization: No overfitting despite model complexity
  3. Effective Optimization: Gradient descent successfully navigates the loss landscape

Code

As always, the code is available on my GitHub. You are free to do whatever you like with my implementation.