Lecture 16: Optimization II — Newton’s Method, BFGS, and Conjugate Gradient#

Beyond Gradient Descent: Using Curvature#

In Lecture 15, we built gradient descent optimizers — they only use the first derivative (gradient) to decide which direction to step.

The key limitation: gradient descent doesn’t know the shape of the landscape. Is the valley wide and flat, or narrow and steep? A method that uses curvature (second derivative) can adapt its step to the local geometry.

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize as sp_minimize

I. Newton’s Method: Quadratic Convergence#

The Idea: Fit a Quadratic, Jump to Its Minimum#

Gradient descent uses a linear approximation of \(f\) at the current point. Newton’s method uses a quadratic approximation (Taylor expansion to second order):

\[f(\mathbf{x} + \Delta\mathbf{x}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^T \Delta\mathbf{x} + \frac{1}{2} \Delta\mathbf{x}^T \, \mathbf{H} \, \Delta\mathbf{x}\]

where \(\mathbf{H}\) is the Hessian matrix of second partial derivatives:

\[H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}\]

Setting the gradient of the quadratic model to zero gives the Newton step:

\[\Delta\mathbf{x} = -\mathbf{H}^{-1} \nabla f(\mathbf{x})\]

For a quadratic function, Newton’s method converges in one step — it jumps directly to the minimum!

For general functions, convergence is quadratic: the error squares at each step (\(\epsilon_{n+1} \propto \epsilon_n^2\)). Compare this to gradient descent’s linear convergence (\(\epsilon_{n+1} \propto c \cdot \epsilon_n\)).

1D Newton’s Method: Root Finding Meets Optimization#

In 1D, the Hessian is just \(f''(x)\), and the Newton step becomes:

\[x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}\]

This is equivalent to finding the root of the derivative — exactly Newton-Raphson applied to \(f'(x) = 0\).

# Reuse our test function from Lecture 15
A, B, C = 1.2, 3.0, 0.6

def f(x):
    return A * x**3 + B * x**2 + C

def f_prime(x):
    return 3 * A * x**2 + 2 * B * x

def f_double_prime(x):
    return 6 * A * x + 2 * B

# 1D Newton's method
def newton_1d(x0, N=50, tol=1e-10):
    x = x0
    x_hist = [x]
    for i in range(N):
        ## code ...
        fp = f_prime(x)
        fpp = f_double_prime(x)
        if abs(fpp) < 1e-12:
            break
        dx = -fp/fpp
        x += dx
        x_hist.append(x)
        if abs(dx) < tol:
            break

    return np.array(x_hist)

# Compare: gradient descent vs Newton
def gradient_descent_1d(x0, eta=0.01, N=200, tol=1e-10):
    x = x0
    x_hist = [x]
    for i in range(N):
        dx = -eta * f_prime(x)
        x = x + dx
        x_hist.append(x)
        if abs(dx) < tol:
            break
    return np.array(x_hist)

x0 = 2.0
xh_gd = gradient_descent_1d(x0, eta=0.01)
xh_newton = newton_1d(x0)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: trajectories
xp = np.linspace(-3, 3, 300)
axes[0].plot(xp, f(xp), 'k-', lw=2)
axes[0].plot(xh_gd, f(xh_gd), 'bo-', ms=4, label=f'GD ({len(xh_gd)} steps)')
axes[0].plot(xh_newton, f(xh_newton), 'rs-', ms=6, label=f'Newton ({len(xh_newton)} steps)')
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_title('Gradient Descent vs Newton')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Right: convergence
axes[1].semilogy(range(len(xh_gd)), np.abs(f_prime(xh_gd)), 'bo-', ms=3, label='Gradient Descent')
axes[1].semilogy(range(len(xh_newton)), np.abs(f_prime(xh_newton)), 'rs-', ms=5, label="Newton's Method")
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel("|f'(x)|")
axes[1].set_title('Convergence Rate')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f'Gradient descent: {len(xh_gd)-1} steps, x* = {xh_gd[-1]:.8f}')
print(f"Newton's method:  {len(xh_newton)-1} steps, x* = {xh_newton[-1]:.8f}")
_images/e1799814f08af93fd442e4cafe33fb948d9a659ababc3b4dfd83a6c96fe3952b.png
Gradient descent: 200 steps, x* = 0.00000365
Newton's method:  7 steps, x* = 0.00000000

Extending to 2D: The Hessian Matrix#

For a function \(f(x, y)\), the Hessian is:

\[\begin{split}\mathbf{H} = \begin{pmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{pmatrix}\end{split}\]

We compute it numerically using central differences.

# Test functions from Lecture 15
def f1(x):
    return x[0]**2 / 2 + x[1]**2 / 3 - x[0] * x[1] / 4

def f2(x):
    return x[0]**2 / 2 + x[1]**2 / 3 - x[0] * x[1] / 4 + 3 * np.exp(-x[0]**2)

# Numerical gradient (central difference)
def derivative2(f, xy, d=0.001):
    x, y = xy[0], xy[1]
    fx = (f([x + d/2, y]) - f([x - d/2, y])) / d
    fy = (f([x, y + d/2]) - f([x, y - d/2])) / d
    return np.array([fx, fy])

# Numerical Hessian (central difference)
def hessian2(f, xy, d=0.001):
    x, y = xy[0], xy[1]
    fxx = (f([x+d, y]) - 2*f([x, y]) + f([x-d, y])) / d**2
    fyy = (f([x, y+d]) - 2*f([x, y]) + f([x, y-d])) / d**2
    fxy = (f([x+d/2, y+d/2]) - f([x+d/2, y-d/2])
         - f([x-d/2, y+d/2]) + f([x-d/2, y-d/2])) / d**2
    return np.array([[fxx, fxy],
                     [fxy, fyy]])

# Grid for contour plots
x_min, x_max = -4, 4
nx = np.linspace(x_min, x_max, 400)
x, y = np.meshgrid(nx, nx)
def minimize_newton(f, x0, N=100, tol=1e-8):
    """2D Newton's method using numerical Hessian."""
    x_now = np.array(x0, dtype=float)
    x_hist = [x_now.copy()]

    for i in range(N):
        grad = derivative2(f, x_now)
        H = hessian2(f, x_now)

        # Newton step: dx = -H^{-1} grad
        try:
            dx = np.linalg.solve(H, -grad)
        except np.linalg.LinAlgError:
            break  # Singular Hessian

        x_next = x_now + dx
        x_hist.append(x_next.copy())

        if np.linalg.norm(dx) < tol:
            break

        x_now = x_next

    return True, np.array(x_hist), f(x_now)
# Newton on f1 (nearly quadratic) -- expect very fast convergence
conv, xh_n, fm = minimize_newton(f1, [3.5, 3.5])

z1 = f1([x, y])
levels1 = np.arange(0, 12, 0.3)

fig, ax = plt.subplots(figsize=(6, 5))
ax.contour(x, y, z1, levels=levels1, cmap='viridis', alpha=0.6)
ax.plot(xh_n[:, 0], xh_n[:, 1], 'ro-', ms=6, lw=2)
ax.plot(xh_n[0, 0], xh_n[0, 1], 'go', ms=10, label='Start')
ax.plot(xh_n[-1, 0], xh_n[-1, 1], 'r*', ms=15, label='End')
ax.set_title(f"Newton's method on f1: {len(xh_n)-1} steps!")
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f'Steps: {len(xh_n)-1}')
print(f'Minimum at: ({xh_n[-1,0]:.8f}, {xh_n[-1,1]:.8f})')
print(f'f_min = {fm:.10f}')
_images/eeb546cf112d81ce49346b4cd369431fb00171bb06f57e7c9a52bba5d681152a.png
Steps: 3
Minimum at: (0.00000000, 0.00000000)
f_min = 0.0000000000

When Newton’s Method Fails#

Newton’s method is powerful but has important limitations:

  1. Non-convex functions: If the Hessian has negative eigenvalues, the Newton step may go to a maximum or saddle point instead of a minimum

  2. Far from minimum: The quadratic approximation may be poor, causing the step to overshoot wildly

  3. Singular Hessian: At inflection points, \(\det(\mathbf{H}) = 0\) and the method breaks down

A practical fix: damped Newton — take only a fraction of the Newton step: $\(\mathbf{x}_{n+1} = \mathbf{x}_n + \alpha \, \Delta\mathbf{x}_\text{Newton}, \quad 0 < \alpha \leq 1\)$

# Newton on f2 from different starting points
start_points = [[3.5, 3.5], [3.5, -3.5], [-3.5, 3.5], [-3.5, -3.5]]

z2 = f2([x, y])
levels2 = np.arange(np.min(z2), 12, 0.3)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Newton
axes[0].contour(x, y, z2, levels=levels2, cmap='viridis', alpha=0.6)
for sp in start_points:
    conv, xh, fm = minimize_newton(f2, sp)
    axes[0].plot(xh[:, 0], xh[:, 1], 'o-', ms=4, lw=1.5)
    axes[0].plot(sp[0], sp[1], 'go', ms=6)
axes[0].set_title("Newton's method on f2")
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].grid(alpha=0.3)

# Damped Newton
def minimize_damped_newton(f, x0, alpha=0.5, N=200, tol=1e-8):
    x_now = np.array(x0, dtype=float)
    x_hist = [x_now.copy()]
    for i in range(N):
        grad = derivative2(f, x_now)
        H = hessian2(f, x_now)
        try:
            dx = np.linalg.solve(H, -grad)
        except np.linalg.LinAlgError:
            break
        x_next = x_now + alpha * dx
        x_hist.append(x_next.copy())
        if np.linalg.norm(alpha * dx) < tol:
            break
        x_now = x_next
    return True, np.array(x_hist), f(x_now)

axes[1].contour(x, y, z2, levels=levels2, cmap='viridis', alpha=0.6)
for sp in start_points:
    conv, xh, fm = minimize_damped_newton(f2, sp, alpha=0.5)
    axes[1].plot(xh[:, 0], xh[:, 1], 'o-', ms=4, lw=1.5)
    axes[1].plot(sp[0], sp[1], 'go', ms=6)
axes[1].set_title("Damped Newton (\u03b1=0.5) on f2")
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()
_images/aa7d99b392c8577a0e69a85a9188db7248f1902fc0ff38b70cd4fd01454dfcf7.png

Cost Analysis#

Gradient Descent

Newton’s Method

Per step

1 gradient evaluation (\(n\) function calls)

1 gradient + 1 Hessian (\(n + n^2\) calls) + solve \(n \times n\) system

Convergence

Linear: \(\epsilon_{n+1} \approx c \cdot \epsilon_n\)

Quadratic: \(\epsilon_{n+1} \approx C \cdot \epsilon_n^2\)

Total steps

Many (100s–1000s)

Few (5–20)

Scaling

\(O(n)\) per step

\(O(n^3)\) per step (matrix solve)

For small problems (\(n \lesssim 100\)), Newton’s method wins easily.
For large problems (\(n \gg 1000\)), the \(O(n^3)\) cost per step becomes prohibitive.

Can we get Newton-like convergence without computing the full Hessian? → Yes! That’s the idea behind BFGS and conjugate gradient.

II. Conjugate Gradient: Smart Search Directions#

The Problem with Gradient Descent#

Look at the path gradient descent takes on an elliptical valley — it zigzags! Each step is perpendicular to the previous one (the new gradient is orthogonal to the old step), so it keeps revisiting the same directions.

Conjugate gradient (CG) fixes this by choosing search directions that are conjugate (or \(\mathbf{H}\)-orthogonal):

\[\mathbf{d}_i^T \mathbf{H} \, \mathbf{d}_j = 0 \quad \text{for } i \neq j\]

This ensures each direction explores a new dimension of the landscape. For a quadratic function in \(n\) dimensions, CG converges in exactly \(n\) steps!

The Algorithm (Fletcher-Reeves)#

  1. Start: \(\mathbf{d}_0 = -\nabla f(\mathbf{x}_0)\) (steepest descent direction)

  2. Line search: find \(\alpha_k\) that minimizes \(f(\mathbf{x}_k + \alpha_k \mathbf{d}_k)\)

  3. Update: \(\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{d}_k\)

  4. New direction: \(\mathbf{d}_{k+1} = -\nabla f(\mathbf{x}_{k+1}) + \beta_k \mathbf{d}_k\)

where \(\beta_k = \frac{\|\nabla f(\mathbf{x}_{k+1})\|^2}{\|\nabla f(\mathbf{x}_k)\|^2}\) (Fletcher-Reeves formula)

The \(\beta_k\) term adds a fraction of the previous direction — this prevents the zigzagging.

def line_search_golden(f_1d, a=-1, b=1, tol=1e-6):
    """1D golden section search for line search."""
    phi = (np.sqrt(5) - 1) / 2
    x1 = b - phi * (b - a)
    x2 = a + phi * (b - a)
    f1, f2_val = f_1d(x1), f_1d(x2)
    for _ in range(100):
        if (b - a) < tol:
            break
        if f1 < f2_val:
            b = x2
            x2, f2_val = x1, f1
            x1 = b - phi * (b - a)
            f1 = f_1d(x1)
        else:
            a = x1
            x1, f1 = x2, f2_val
            x2 = a + phi * (b - a)
            f2_val = f_1d(x2)
    return (a + b) / 2

def minimize_cg(f, x0, N=200, tol=1e-8):
    """Conjugate gradient with Fletcher-Reeves formula."""
    x_now = np.array(x0, dtype=float)
    grad = derivative2(f, x_now)
    d = -grad.copy()  # Initial direction = steepest descent
    x_hist = [x_now.copy()]

    for k in range(N):
        # Line search along direction d
        f_1d = lambda alpha, xn=x_now, dn=d: f(xn + alpha * dn)
        alpha = line_search_golden(f_1d, a=0, b=2.0)

        x_next = x_now + alpha * d
        x_hist.append(x_next.copy())

        if np.linalg.norm(x_next - x_now) < tol:
            break

        ## code .....
        grad_new = derivative2(f, x_next)
        # fprime= np.dot(grad,grad)
        # if  fprime < 1e-12:  fprime = 1e-12

        beta = np.dot(grad_new, grad_new) / max(np.dot(grad, grad), 1e-12)
        d = -grad_new + beta * d
        grad = grad_new
        x_now = x_next

    return True, np.array(x_hist), f(x_now)
# Compare gradient descent vs conjugate gradient on f1
def minimize_fix(f, x0, dx=0.05, N=1000):
    x_now = np.array(x0, dtype=float)
    x_hist = [x_now.copy()]
    for i in range(N):
        dfx = derivative2(f, x_now)
        x_next = x_now - dx * dfx
        if np.linalg.norm(x_next - x_now) < 1e-6:
            break
        x_now = x_next
        x_hist.append(x_now.copy())
    return True, np.array(x_hist), f(x_now)

sp = [3.5, 3.5]
_, xh_gd, _ = minimize_fix(f1, sp)
_, xh_cg, _ = minimize_cg(f1, sp)

z1 = f1([x, y])
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].contour(x, y, z1, levels=np.arange(0, 12, 0.3), cmap='viridis', alpha=0.6)
axes[0].plot(xh_gd[:, 0], xh_gd[:, 1], 'ro-', ms=2, lw=1)
axes[0].plot(sp[0], sp[1], 'go', ms=8)
axes[0].set_title(f'Gradient Descent: {len(xh_gd)-1} steps')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].grid(alpha=0.3)

axes[1].contour(x, y, z1, levels=np.arange(0, 12, 0.3), cmap='viridis', alpha=0.6)
axes[1].plot(xh_cg[:, 0], xh_cg[:, 1], 'ro-', ms=4, lw=1.5)
axes[1].plot(sp[0], sp[1], 'go', ms=8)
axes[1].set_title(f'Conjugate Gradient: {len(xh_cg)-1} steps')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].grid(alpha=0.3)

plt.suptitle('Zigzagging (GD) vs Straight Path (CG) on f1', fontsize=12)
plt.tight_layout()
plt.show()

print(f'Gradient descent: {len(xh_gd)-1} steps')
print(f'Conjugate gradient: {len(xh_cg)-1} steps')
_images/7d5d9eee5f95620cbaa944b761267dfd5102fb2b7e895cb0c5ffa03cd9f445f3.png
Gradient descent: 435 steps
Conjugate gradient: 4 steps

III. BFGS: The Best of Both Worlds#

Quasi-Newton Methods: Approximate the Hessian#

Newton’s method needs the exact Hessian (\(O(n^2)\) second derivatives). Quasi-Newton methods build an approximation to the Hessian (or its inverse) using only gradient information — updating it step by step.

BFGS (Broyden–Fletcher–Goldfarb–Shanno) is the most successful quasi-Newton method:

  1. Start with \(\mathbf{B}_0 = \mathbf{I}\) (identity matrix — pretend the Hessian is the identity)

  2. Compute search direction: \(\mathbf{d}_k = -\mathbf{B}_k^{-1} \nabla f(\mathbf{x}_k)\)

  3. Line search: find step size \(\alpha_k\)

  4. Update position: \(\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{d}_k\)

  5. Update the Hessian approximation using the secant condition:

\[\mathbf{B}_{k+1} = \mathbf{B}_k + \frac{\mathbf{y}_k \mathbf{y}_k^T}{\mathbf{y}_k^T \mathbf{s}_k} - \frac{\mathbf{B}_k \mathbf{s}_k \mathbf{s}_k^T \mathbf{B}_k}{\mathbf{s}_k^T \mathbf{B}_k \mathbf{s}_k}\]

where \(\mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k\) and \(\mathbf{y}_k = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k)\).

Key insight: BFGS “learns” the curvature from how the gradient changes — no second derivatives needed!

def minimize_bfgs(f, x0, N=200, tol=1e-8):
    """BFGS quasi-Newton method."""
    n = len(x0)
    x_now = np.array(x0, dtype=float)
    B = np.eye(n)  # Initial Hessian approximation = identity
    grad = derivative2(f, x_now)
    x_hist = [x_now.copy()]

    for k in range(N):
        # Search direction
        d = np.linalg.solve(B, -grad)

        # Line search along d
        f_1d = lambda alpha, xn=x_now, dn=d: f(xn + alpha * dn)
        alpha = line_search_golden(f_1d, a=0, b=2.0)

        s = alpha * d  # Step
        x_next = x_now + s
        x_hist.append(x_next.copy())

        if np.linalg.norm(s) < tol:
            break

        grad_new = derivative2(f, x_next)
        y = grad_new - grad  # Gradient change

        # BFGS update
        sy = np.dot(s, y)
        if abs(sy) > 1e-15:  # Skip update if s and y are nearly orthogonal
            Bs = B @ s
            B = B + np.outer(y, y) / sy - np.outer(Bs, Bs) / np.dot(s, Bs)

        x_now = x_next
        grad = grad_new

    return True, np.array(x_hist), f(x_now)
# BFGS on f1 and f2
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

z1 = f1([x, y])
axes[0].contour(x, y, z1, levels=np.arange(0, 12, 0.3), cmap='viridis', alpha=0.6)
for sp in [[3.5, 3.5], [3.5, -3.5], [-3.5, 3.5], [-3.5, -3.5]]:
    conv, xh, fm = minimize_bfgs(f1, sp)
    axes[0].plot(xh[:, 0], xh[:, 1], 'o-', ms=4, lw=1.5)
    axes[0].plot(sp[0], sp[1], 'go', ms=6)
axes[0].set_title('BFGS on f1')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].grid(alpha=0.3)

z2 = f2([x, y])
levels2 = np.arange(np.min(z2), 12, 0.3)
axes[1].contour(x, y, z2, levels=levels2, cmap='viridis', alpha=0.6)
for sp in [[3.5, 3.5], [3.5, -3.5], [-3.5, 3.5], [-3.5, -3.5]]:
    conv, xh, fm = minimize_bfgs(f2, sp)
    axes[1].plot(xh[:, 0], xh[:, 1], 'o-', ms=4, lw=1.5)
    axes[1].plot(sp[0], sp[1], 'go', ms=6)
axes[1].set_title('BFGS on f2')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()
_images/997a82a5d0f94da1341871cbe386dd6d425b45e41bf3f453a73d1c36512210af.png

IV. Method Showdown: Comparing All Approaches#

# Variable step (Barzilai-Borwein) from Lecture 15
def minimize_var(f, x0, N=1000):
    x_now = np.array(x0, dtype=float)
    x_prev = None
    converged = False
    x_hist = [x_now.copy()]
    for i in range(N):
        df_now = derivative2(f, x_now)
        if x_prev is None:
            dx = 0.01
        else:
            df_prev = derivative2(f, x_prev)
            dd = df_now - df_prev
            dx = np.dot((x_now - x_prev), dd) / (np.linalg.norm(dd))**2
        x_next = x_now - df_now * dx
        if f(x_next) < f(x_now):
            x_prev = x_now
            x_now = x_next
            x_hist.append(x_now.copy())
        else:
            converged = True
            break
    return converged, np.array(x_hist), f(x_now)
# Grand comparison: all methods on f2
start_points = [[3.5, 3.5], [3.5, -3.5], [-3.5, 3.5], [-3.5, -3.5]]
methods = [
    ('Fixed Step', minimize_fix),
    ('Barzilai-Borwein', minimize_var),
    ('Newton', minimize_newton),
    ('Conjugate Grad', minimize_cg),
    ('BFGS', minimize_bfgs),
]

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes_flat = axes.flatten()

z2 = f2([x, y])
levels2 = np.arange(np.min(z2), 12, 0.3)

for col, (name, method) in enumerate(methods):
    ax = axes_flat[col]
    ax.contour(x, y, z2, levels=levels2, cmap='viridis', alpha=0.6)
    for sp in start_points:
        conv, xh, fm = method(f2, sp)
        ax.plot(xh[:, 0], xh[:, 1], 'o-', ms=2, lw=1)
        ax.plot(sp[0], sp[1], 'go', ms=5)
    ax.set_title(name)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.grid(alpha=0.3)

# scipy BFGS in last panel
axes_flat[5].contour(x, y, z2, levels=levels2, cmap='viridis', alpha=0.6)
for sp in start_points:
    traj = [sp]
    sp_minimize(f2, sp, method='BFGS', callback=lambda xk, sp=sp: traj.append(xk.tolist()), tol=1e-6)
    traj = np.array(traj)
    axes_flat[5].plot(traj[:, 0], traj[:, 1], 'o-', ms=2, lw=1)
    axes_flat[5].plot(sp[0], sp[1], 'go', ms=5)
axes_flat[5].set_title('scipy BFGS')
axes_flat[5].set_xlabel('x')
axes_flat[5].set_ylabel('y')
axes_flat[5].grid(alpha=0.3)

plt.suptitle('All Methods on f2 (4 Starting Points)', fontsize=13)
plt.tight_layout()
plt.show()
_images/fa7683a508a2de80c92f52e75f204c7cfd9e17224166ae013bd818f5b931af79.png
# Quantitative comparison
print('Method Comparison on f2 (two minima)')
print(f'{"Start":>14s} | {"Method":>16s} | {"Steps":>5s} | {"f_min":>8s} | {"Final (x, y)":>20s}')
print('-' * 76)

for sp in start_points:
    for name, method in methods:
        conv, xh, fm = method(f2, sp)
        end = xh[-1]
        print(f'({sp[0]:5.1f},{sp[1]:5.1f}) | {name:>16s} | {len(xh)-1:5d} | {fm:8.4f} | ({end[0]:8.4f}, {end[1]:8.4f})')
    # Also scipy BFGS
    res = sp_minimize(f2, sp, method='BFGS', tol=1e-6)
    print(f'({sp[0]:5.1f},{sp[1]:5.1f}) | {"scipy BFGS":>16s} | {res.nit:5d} | {res.fun:8.4f} | ({res.x[0]:8.4f}, {res.x[1]:8.4f})')
    print()
Method Comparison on f2 (two minima)
         Start |           Method | Steps |    f_min |         Final (x, y)
----------------------------------------------------------------------------
(  3.5,  3.5) |       Fixed Step |   354 |   1.3096 | (  1.3748,   0.5156)
(  3.5,  3.5) | Barzilai-Borwein |     8 |   1.3096 | (  1.3762,   0.5162)
(  3.5,  3.5) |           Newton |     4 |   3.0000 | (  0.0000,   0.0000)
(  3.5,  3.5) |   Conjugate Grad |    13 |   1.3096 | (  1.3748,   0.5156)
(  3.5,  3.5) |             BFGS |     6 |   1.3096 | (  1.3748,   0.5156)
(  3.5,  3.5) |       scipy BFGS |     8 |   1.3096 | (  1.3748,   0.5156)

(  3.5, -3.5) |       Fixed Step |   357 |   1.3096 | (  1.3748,   0.5155)
(  3.5, -3.5) | Barzilai-Borwein |     2 |   2.7599 | ( -0.3739,  -0.6540)
(  3.5, -3.5) |           Newton |     4 |   3.0000 | (  0.0000,  -0.0000)
(  3.5, -3.5) |   Conjugate Grad |    17 |   1.3096 | ( -1.3748,  -0.5156)
(  3.5, -3.5) |             BFGS |     7 |   1.3096 | ( -1.3748,  -0.5156)
(  3.5, -3.5) |       scipy BFGS |     7 |   1.3096 | (  1.3748,   0.5156)

( -3.5,  3.5) |       Fixed Step |   357 |   1.3096 | ( -1.3748,  -0.5155)
( -3.5,  3.5) | Barzilai-Borwein |     2 |   2.7599 | (  0.3739,   0.6540)
( -3.5,  3.5) |           Newton |     4 |   3.0000 | ( -0.0000,   0.0000)
( -3.5,  3.5) |   Conjugate Grad |    17 |   1.3096 | (  1.3748,   0.5156)
( -3.5,  3.5) |             BFGS |     7 |   1.3096 | (  1.3748,   0.5156)
( -3.5,  3.5) |       scipy BFGS |     7 |   1.3096 | ( -1.3748,  -0.5156)

( -3.5, -3.5) |       Fixed Step |   354 |   1.3096 | ( -1.3748,  -0.5156)
( -3.5, -3.5) | Barzilai-Borwein |     8 |   1.3096 | ( -1.3762,  -0.5162)
( -3.5, -3.5) |           Newton |     4 |   3.0000 | ( -0.0000,  -0.0000)
( -3.5, -3.5) |   Conjugate Grad |    13 |   1.3096 | ( -1.3748,  -0.5156)
( -3.5, -3.5) |             BFGS |     6 |   1.3096 | ( -1.3748,  -0.5156)
( -3.5, -3.5) |       scipy BFGS |     8 |   1.3096 | ( -1.3748,  -0.5156)

V. Constrained Optimization#

Optimization with Constraints#

Many physics problems have constraints:

  • Find the shape of a hanging chain (minimize potential energy, subject to fixed length)

  • Find the ground state energy (minimize \(\langle H \rangle\), subject to \(\langle\psi|\psi\rangle = 1\))

  • Find the equilibrium configuration (minimize free energy, subject to fixed particle number)

Mathematically: minimize \(f(\mathbf{x})\) subject to \(g(\mathbf{x}) = 0\)

Lagrange Multipliers#

The classical approach: at a constrained minimum, the gradient of \(f\) must be parallel to the gradient of \(g\):

\[\nabla f = \lambda \nabla g\]

This gives us the Lagrangian: \(\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) - \lambda \, g(\mathbf{x})\)

Setting \(\nabla_{\mathbf{x}} \mathcal{L} = 0\) and \(\partial \mathcal{L}/\partial \lambda = 0\) gives a system of equations.

Penalty Method#

A simpler computational approach: convert the constrained problem to an unconstrained one by adding a penalty for violating the constraint:

\[\tilde{f}(\mathbf{x}) = f(\mathbf{x}) + \mu \, [g(\mathbf{x})]^2\]

As \(\mu \to \infty\), the minimum of \(\tilde{f}\) approaches the constrained minimum of \(f\).

# Example: minimize f(x,y) = x + y  subject to  x^2 + y^2 = 1
# Find the point on the unit circle closest to the direction (-1,-1)
# Analytical answer: x* = y* = -1/sqrt(2)

def f_obj(xy):
    return xy[0] + xy[1]

def g_constraint(xy):
    return xy[0]**2 + xy[1]**2 - 1

# Penalty method: minimize f + mu * g^2
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

penalties = [1, 10, 100]
theta = np.linspace(0, 2*np.pi, 200)

for idx, mu in enumerate(penalties):
    def f_penalty(xy, mu=mu):
        return f_obj(xy) + mu * g_constraint(xy)**2

    conv, xh, fm = minimize_bfgs(f_penalty, [0.5, 0.8])

    xg = np.linspace(-2, 2, 200)
    xg, yg = np.meshgrid(xg, xg)
    zp = f_obj([xg, yg]) + mu * (xg**2 + yg**2 - 1)**2
    axes[idx].contour(xg, yg, zp, levels=30, cmap='viridis', alpha=0.6)
    axes[idx].plot(np.cos(theta), np.sin(theta), 'k-', lw=2, label='Constraint')
    axes[idx].plot(xh[:, 0], xh[:, 1], 'ro-', ms=3, lw=1)
    axes[idx].plot(xh[-1, 0], xh[-1, 1], 'r*', ms=12)
    axes[idx].set_title(f'\u03bc = {mu}\n({xh[-1,0]:.3f}, {xh[-1,1]:.3f})')
    axes[idx].set_xlim(-2, 2)
    axes[idx].set_ylim(-2, 2)
    axes[idx].set_aspect('equal')
    axes[idx].grid(alpha=0.3)

plt.suptitle('Penalty Method: min(x+y) on unit circle', fontsize=12)
plt.tight_layout()
plt.show()

x_exact = -1/np.sqrt(2)
print(f'Exact solution: ({x_exact:.6f}, {x_exact:.6f})')
print(f'As mu increases, the penalty solution approaches the exact answer.')
_images/d012439daf47d256ab82f079888ebd42a61277548ce129b219c892475b3fa4e3.png
Exact solution: (-0.707107, -0.707107)
As mu increases, the penalty solution approaches the exact answer.
# scipy with explicit constraint
constraint = {'type': 'eq', 'fun': g_constraint}

res = sp_minimize(f_obj, [0.5, 0.8], method='SLSQP', constraints=constraint)
print('scipy SLSQP (constrained optimization):')
print(f'  Solution: ({res.x[0]:.8f}, {res.x[1]:.8f})')
print(f'  f_min = {res.fun:.8f}')
print(f'  Constraint satisfied: g = {g_constraint(res.x):.2e}')
print(f'  Exact: ({-1/np.sqrt(2):.8f}, {-1/np.sqrt(2):.8f})')
scipy SLSQP (constrained optimization):
  Solution: (-0.70710678, -0.70710678)
  f_min = -1.41421356
  Constraint satisfied: g = 5.06e-10
  Exact: (-0.70710678, -0.70710678)

VI. Summary#

Method Comparison#

Method

Convergence

Cost per step

Needs

Best for

Gradient descent

Linear

\(O(n)\)

Gradient

Simple problems, large \(n\)

Conjugate gradient

Superlinear

\(O(n)\)

Gradient + line search

Medium \(n\), quadratic-ish

Newton

Quadratic

\(O(n^3)\)

Gradient + Hessian

Small \(n\), exact Hessian available

BFGS

Superlinear

\(O(n^2)\)

Gradient only

General purpose, moderate \(n\)

Key Takeaways#

  1. Curvature information dramatically accelerates convergence — Newton converges quadratically vs gradient descent’s linear convergence

  2. BFGS is the practical workhorse — it approximates curvature from gradient changes, no Hessian needed

  3. Conjugate gradient avoids zigzagging by choosing \(\mathbf{H}\)-orthogonal search directions

  4. Constrained optimization can use penalty methods (simple) or dedicated algorithms (SLSQP, trust-constr)

  5. All local methods find the nearest minimum — for global optimization, combine with simulated annealing (Lecture 14)

When to Use What (Practical Guide)#

Small problem (n < 100):     scipy BFGS or Newton-CG
Medium problem (n ~ 1000):   scipy L-BFGS-B (memory-limited BFGS)
Large problem (n > 10000):   scipy CG or custom gradient descent
No derivatives available:    scipy Nelder-Mead or Powell
With constraints:            scipy SLSQP or trust-constr