Skip to main content
Ctrl+K

Computational Physics Lecture Notes

  • Computational Physics, Rutgers-Newark
  • Python Basics I
  • Python Basics II
  • Python Basics III
  • Numerical Integration
  • Numerical Differentiation
  • Interpolation
  • Curve Fitting and Regression
  • Fourier Transform
  • Fast Fourier Transform
  • Ordinary Differential Equations
  • Partial Differential Equations
  • Stochastic Methods I
  • Stochastic Methods II
  • Monte Carlo Methods
  • Optimization I
  • Optimization II
  • Global Optimization I
  • Global Optimization II
  • Global Optimization III
  • Machine Learning Foundations
  • Neural Networks
  • Deep Learning for Physics
  • Graph Neural Networks
  • .ipynb

Lecture 15: Optimization I — Gradient Descent and Line Search

Contents

  • Finding Minima: The Core Problem of Computational Science
  • I. The Optimization Landscape
    • Local vs Global Minima
  • II. 1D Optimization: Gradient Descent
    • The Gradient Descent Algorithm
    • Why Step Size Matters
    • Test Function: A Cubic Polynomial
    • Implementation: Fixed Step Size
    • Sensitivity to Starting Point
    • Improvement: Variable Step Size (Barzilai-Borwein Method)
  • III. Golden Section Search: No Derivatives Needed
    • A Different Approach: Bracket-Based Search
    • The Golden Section Search
    • Why the Golden Ratio?
    • Golden Section vs Gradient Descent
  • IV. Extending to 2D: Gradient Descent on Surfaces
    • 2D Test Functions
    • 2D Numerical Gradient
    • Method 1: Fixed Step Size in 2D
    • Method 2: Variable Step Size (Barzilai-Borwein) in 2D
    • Comparing Both Methods in 2D
  • V. Using scipy.optimize
    • Professional Optimization with scipy.optimize.minimize
    • Summary: What We Learned Today

Lecture 15: Optimization I — Gradient Descent and Line Search#

Computational Physics — Spring 2026

Finding Minima: The Core Problem of Computational Science#

Optimization is one of the most fundamental tasks in computational physics. Nearly every simulation, fit, or model ultimately requires finding the minimum of some function:

  • Molecular physics: Finding stable configurations (minimize potential energy)

  • Data fitting: Minimizing the residual between model and data (Lecture 7, HW4)

  • Quantum mechanics: Variational principle — the ground state minimizes \(\langle \hat{H} \rangle\) (Lecture 13 VMC)

  • Machine learning: Training = minimizing a loss function

In Lecture 14, we used simulated annealing — a global, stochastic optimizer. Today we study local optimization: deterministic methods that follow the landscape downhill to the nearest minimum.

Today’s roadmap:

  • The optimization landscape: local vs global minima

  • 1D gradient descent: fixed step and adaptive step

  • Golden section search: derivative-free 1D optimization

  • Extending gradient descent to 2D

  • Introduction to scipy.optimize

import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [6, 4]
plt.rcParams['font.size'] = 9

print("Ready for optimization!")
Ready for optimization!

I. The Optimization Landscape#

Local vs Global Minima#

Given a function \(f(x)\), we want to find \(x^*\) such that \(f(x^*) \leq f(x)\) for all \(x\) nearby.

Key concepts:

  • Local minimum: \(f(x^*) \leq f(x)\) for all \(x\) in a neighborhood of \(x^*\)

  • Global minimum: \(f(x^*) \leq f(x)\) for all \(x\)

  • Saddle point: A point where the gradient is zero but it’s neither a minimum nor maximum

  • Convex function: A function where every local minimum is also the global minimum

Necessary condition for a minimum: \(f'(x^*) = 0\) and \(f''(x^*) > 0\).

The challenge: Local optimization methods can only find local minima. The result depends on where we start. This is why we ran simulated annealing in Lecture 14 — it can escape local minima via thermal fluctuations.

# Visualize: a landscape with multiple minima
x = np.linspace(-2, 3, 300)
f_landscape = lambda x: x**4 - 3*x**3 + 2*x**2 + x - 1

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: function with local and global minima
axes[0].plot(x, f_landscape(x), 'b-', lw=2)
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_title('Multiple Minima: Where You Start Matters')
axes[0].grid(alpha=0.3)

# Mark local and global minima (approximate)
from scipy.optimize import minimize_scalar
res1 = minimize_scalar(f_landscape, bounds=(-1, 0.5), method='bounded')
res2 = minimize_scalar(f_landscape, bounds=(1.5, 3), method='bounded')
axes[0].plot(res1.x, res1.fun, 'g*', ms=15, label=f'Global min ({res1.x:.2f}, {res1.fun:.2f})')
axes[0].plot(res2.x, res2.fun, 'ro', ms=10, label=f'Local min ({res2.x:.2f}, {res2.fun:.2f})')
axes[0].legend(fontsize=8)

# Right: convex function — every local min is global
f_convex = lambda x: (x - 1)**2 + 0.5
axes[1].plot(x, f_convex(x), 'b-', lw=2)
axes[1].plot(1, 0.5, 'r*', ms=15, label='Unique global min')
axes[1].set_xlabel('x')
axes[1].set_ylabel('f(x)')
axes[1].set_title('Convex Function: Local = Global')
axes[1].legend(fontsize=8)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("For convex functions, any local minimum is the global minimum.")
print("For non-convex functions, the result depends on the starting point!")
_images/5c1206c77589ca6dc617ed1de5d0fc48a3e9674dd720539f82b6f167a17eb0c5.png
For convex functions, any local minimum is the global minimum.
For non-convex functions, the result depends on the starting point!

II. 1D Optimization: Gradient Descent#

The Gradient Descent Algorithm#

The simplest optimization method: follow the slope downhill.

Update rule:

\[x_{n+1} = x_n - \eta \, f'(x_n)\]

where \(\eta\) is the step size (or learning rate).

  • If \(f'(x_n) > 0\): the function is increasing, so we move left (\(x\) decreases)

  • If \(f'(x_n) < 0\): the function is decreasing, so we move right (\(x\) increases)

  • If \(f'(x_n) = 0\): we’re at a critical point (minimum, maximum, or saddle)

We approximate the derivative numerically using the central difference formula:

\[f'(x) \approx \frac{f(x + \delta) - f(x - \delta)}{2\delta}\]

Why Step Size Matters#

  • \(\eta\) too large: overshoot the minimum, oscillate or diverge

  • \(\eta\) too small: converge very slowly, waste computation

  • Just right: efficient convergence

Test Function: A Cubic Polynomial#

\[f(x) = Ax^3 + Bx^2 + C\]

with \(A = 1.2\), \(B = 3.0\), \(C = 0.6\). This function has a local minimum near \(x = 0\) and goes to \(-\infty\) as \(x \to -\infty\).

# Define coefficients for our polynomial
A, B, C = 1.2, 3.0, 0.6

x_min, x_max = -1, 1
f = lambda x: A*x**3 + B*x**2 + C

# Visualize
x = np.linspace(x_min, x_max, 101)
plt.figure(figsize=(8, 4))
plt.plot(x, f(x), 'b-', lw=2, label='$f(x) = 1.2x^3 + 3x^2 + 0.6$')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Our Test Function')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# Analytical minimum: f'(x) = 3Ax^2 + 2Bx = 0 => x = 0 or x = -2B/(3A)
x_crit = -2*B / (3*A)
print(f"Critical points: x = 0 and x = {x_crit:.4f}")
print(f"f(0) = {f(0):.4f}, f({x_crit:.4f}) = {f(x_crit):.4f}")
_images/a8ea9d59194e909d134219641246e0d0f1c05fa72a4f30b73471467ce9b1a2a7.png
Critical points: x = 0 and x = -1.6667
f(0) = 0.6000, f(-1.6667) = 3.3778

Implementation: Fixed Step Size#

def derivative(f, x, dx=0.01):
    """Numerical derivative using central difference."""
    return (f(x + dx) - f(x - dx)) / (2 * dx)

def minimize(f, x0, dx=0.01, N=1000):
    """Gradient descent with fixed step size.

    Args:
        f: function to minimize
        x0: starting point
        dx: step size (learning rate)
        N: max iterations

    Returns:
        converged, x_best, f_min, x_hist, step
    """
    x_now = x0
    converged = False
    x_hist = [x0]

    for i in range(N):
        x_next = x_now - dx * derivative(f, x_now)
        if f(x_next) < f(x_now):
            x_now = x_next
            x_hist.append(x_now)
        else:
            converged = True
            break

    step = i
    return converged, x_now, f(x_now), np.array(x_hist), step
# Test with a specific starting point
x0 = -0.83
converged, x_best, f_min, x_hist, step = minimize(f, x0, dx=0.1)

print(f'Starting point: x0 = {x0}')
print(f'Converged: {converged}')
print(f'Best solution: x = {x_best:.4f}, f = {f_min:.4f}')
print(f'Steps taken: {step}')

# Visualize the optimization path
x = np.linspace(x_min, x_max, 101)
plt.figure(figsize=(8, 4))
plt.plot(x, f(x), 'b-', lw=2, label='f(x)')
plt.plot(x_hist, f(x_hist), 'ro-', ms=5, label='Optimization path')
plt.plot(x0, f(x0), 'go', ms=10, label=f'Start (x={x0})')
plt.plot(x_best, f_min, 'k*', ms=15, label=f'Solution (x={x_best:.3f})')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Gradient Descent: Fixed Step Size')
plt.legend(fontsize=8)
plt.grid(alpha=0.3)
plt.show()
Starting point: x0 = -0.83
Converged: True
Best solution: x = -0.0000, f = 0.6000
Steps taken: 33
_images/ad6fc1e33a4d759af03a8f1f25a2954a5835bde7edc88b49ce50d30610a73504.png

Sensitivity to Starting Point#

Let’s run many trials with random starting points to see how the result depends on where we begin.

np.random.seed(42)

print(f'{"x0":>10s} | {"Converged":>9s} | {"x_best":>8s} | {"f(x_best)":>10s} | {"Steps":>5s}')
print('-' * 55)

for i in range(20):
    x1 = x_min + np.random.random() * (x_max - x_min)
    converged, x_best, f_min, x_hist, step = minimize(f, x1, dx=0.02)
    print(f'{x1:10.4f} | {str(converged):>9s} | {x_best:8.4f} | {f_min:10.4f} | {step:5d}')
        x0 | Converged |   x_best |  f(x_best) | Steps
-------------------------------------------------------
   -0.2509 |      True |  -0.0000 |     0.6000 |   193
    0.9014 |      True |  -0.0000 |     0.6000 |    80
    0.4640 |      True |   0.0000 |     0.6000 |    76
    0.1973 |      True |  -0.0000 |     0.6000 |    71
   -0.6880 |      True |  -0.0000 |     0.6000 |   204
   -0.6880 |      True |  -0.0000 |     0.6000 |   204
   -0.8838 |      True |  -0.0000 |     0.6000 |   208
    0.7324 |      True |  -0.0000 |     0.6000 |    79
    0.2022 |      True |   0.0000 |     0.6000 |    71
    0.4161 |      True |  -0.0000 |     0.6000 |    76
   -0.9588 |      True |  -0.0000 |     0.6000 |   211
    0.9398 |      True |   0.0000 |     0.6000 |    80
    0.6649 |      True |   0.0000 |     0.6000 |    78
   -0.5753 |      True |  -0.0000 |     0.6000 |   203
   -0.6364 |      True |  -0.0000 |     0.6000 |   203
   -0.6332 |      True |  -0.0000 |     0.6000 |   203
   -0.3915 |      True |  -0.0000 |     0.6000 |   199
    0.0495 |      True |  -0.0000 |     0.6000 |    61
   -0.1361 |      True |  -0.0000 |     0.6000 |   189
   -0.4175 |      True |  -0.0000 |     0.6000 |   198

Improvement: Variable Step Size (Barzilai-Borwein Method)#

Instead of a fixed \(\eta\), we can adapt the step size at each iteration. The Barzilai-Borwein method uses information from the previous step:

\[\eta_n = \frac{x_n - x_{n-1}}{f'(x_n) - f'(x_{n-1})}\]

This is derived from a secant approximation to the Hessian. The idea: if the gradient hasn’t changed much, take a larger step; if it changed a lot, take a smaller step.

def minimize2(f, x0, N=1000):
    """Gradient descent with Barzilai-Borwein variable step size."""
    x_now = x0
    x_prev = None
    converged = False
    x_hist = [x0]

    dx = 0.1  # initial step size
    for i in range(N):
        if x_prev is not None:
            dfx = derivative(f, x_now) - derivative(f, x_prev)
            if abs(dfx) > 1e-8:
                dx = (x_now - x_prev) / dfx
            else:
                dx = 0.1

        x_next = x_now - derivative(f, x_now) * dx
        if f(x_next) < f(x_now):
            x_prev = x_now
            x_now = x_next
            x_hist.append(x_now)
        else:
            converged = True
            break

    return converged, x_now, f(x_now), np.array(x_hist), i

# Compare with fixed step
np.random.seed(42)
print(f'{"x0":>10s} | {"Method":>10s} | {"Converged":>9s} | {"x_best":>8s} | {"f(x_best)":>10s} | {"Steps":>5s}')
print('-' * 70)

for i in range(10):
    x1 = x_min + np.random.random() * (x_max - x_min)
    c1, xb1, fm1, _, s1 = minimize(f, x1, dx=0.05)
    c2, xb2, fm2, _, s2 = minimize2(f, x1)
    print(f'{x1:10.4f} | {"Fixed":>10s} | {str(c1):>9s} | {xb1:8.4f} | {fm1:10.4f} | {s1:5d}')
    print(f'{"":>10s} | {"Variable":>10s} | {str(c2):>9s} | {xb2:8.4f} | {fm2:10.4f} | {s2:5d}')
    print()
        x0 |     Method | Converged |   x_best |  f(x_best) | Steps
----------------------------------------------------------------------
   -0.2509 |      Fixed |      True |  -0.0000 |     0.6000 |    72
           |   Variable |      True |  -0.0000 |     0.6000 |     5

    0.9014 |      Fixed |      True |   0.0000 |     0.6000 |    28
           |   Variable |      True |  -0.0000 |     0.6000 |     4

    0.4640 |      Fixed |      True |   0.0000 |     0.6000 |    27
           |   Variable |      True |  -0.0000 |     0.6000 |     4

    0.1973 |      Fixed |      True |   0.0000 |     0.6000 |    25
           |   Variable |      True |  -0.0000 |     0.6000 |     4

   -0.6880 |      Fixed |      True |  -0.0000 |     0.6000 |    78
           |   Variable |      True |  -0.4456 |     1.0895 |     1

   -0.6880 |      Fixed |      True |  -0.0000 |     0.6000 |    78
           |   Variable |      True |  -0.4456 |     1.0896 |     1

   -0.8838 |      Fixed |      True |  -0.0000 |     0.6000 |    78
           |   Variable |      True |  -0.6348 |     1.5019 |     1

    0.7324 |      Fixed |      True |  -0.0000 |     0.6000 |    28
           |   Variable |      True |   0.0000 |     0.6000 |     4

    0.2022 |      Fixed |      True |   0.0000 |     0.6000 |    25
           |   Variable |      True |  -0.0000 |     0.6000 |     4

    0.4161 |      Fixed |      True |  -0.0000 |     0.6000 |    27
           |   Variable |      True |  -0.0000 |     0.6000 |     4

III. Golden Section Search: No Derivatives Needed#

A Different Approach: Bracket-Based Search#

Gradient descent requires computing \(f'(x)\). What if the derivative is expensive or doesn’t exist? We can use a bracketing method instead.

Idea: Maintain a bracket \([a, b]\) that contains the minimum, and shrink it at each step.

The Golden Section Search#

Analogous to the bisection method for root-finding, but for minimization.

Algorithm:

  1. Start with bracket \([a, b]\) containing a minimum

  2. Evaluate \(f\) at two interior points: $\(x_1 = a + (1 - \varphi)(b - a), \quad x_2 = a + \varphi(b - a)\)\( where \)\varphi = \frac{\sqrt{5} - 1}{2} \approx 0.618$ is the golden ratio conjugate

  3. If \(f(x_1) < f(x_2)\): the minimum is in \([a, x_2]\) — set \(b = x_2\)

  4. If \(f(x_1) \geq f(x_2)\): the minimum is in \([x_1, b]\) — set \(a = x_1\)

  5. Repeat until \(|b - a| < \text{tolerance}\)

Why the Golden Ratio?#

The key insight: with the golden ratio, one of the two evaluation points from the previous step can be reused in the next step. This means we only need one new function evaluation per step (instead of two), cutting the work in half.

Proof: After narrowing \([a, b]\) to \([a, x_2]\), the old point \(x_1\) satisfies:

\[\frac{x_1 - a}{x_2 - a} = \frac{(1-\varphi)(b-a)}{\varphi(b-a)} = \frac{1-\varphi}{\varphi} = \varphi\]

So \(x_1\) is at position \(\varphi\) within \([a, x_2]\) — exactly where the new upper evaluation point should be! This works because \(\varphi\) satisfies \(\varphi^2 + \varphi = 1\), the defining property of the golden ratio. \(\square\)

Convergence: The bracket shrinks by factor \(\varphi \approx 0.618\) at each step, giving linear convergence.

def golden_section_search(f, a, b, tol=1e-6, N=100):
    """Golden section search for minimum of f on [a, b].

    Requires: f has a single minimum in [a, b] (unimodal).
    """
    phi = (np.sqrt(5) - 1) / 2  # golden ratio conjugate ≈ 0.618

    # Initial interior points
    x1 = a + (1 - phi) * (b - a)
    x2 = a + phi * (b - a)
    f1 = f(x1)
    f2 = f(x2)

    history = [(a, b, x1, x2)]

    for i in range(N):
        if abs(b - a) < tol:
            break

        if f1 < f2:
            # Minimum is in [a, x2]
            b = x2
            x2 = x1            # reuse old x1 as new x2
            f2 = f1
            x1 = a + (1 - phi) * (b - a)  # only one new evaluation
            f1 = f(x1)
        else:
            # Minimum is in [x1, b]
            a = x1
            x1 = x2            # reuse old x2 as new x1
            f1 = f2
            x2 = a + phi * (b - a)  # only one new evaluation
            f2 = f(x2)

        history.append((a, b, x1, x2))

    x_min = (a + b) / 2
    return x_min, f(x_min), i, history

# Test on our polynomial
x_opt, f_opt, steps, hist = golden_section_search(f, -0.5, 0.5)

print(f"Golden section search on f(x) = {A}x³ + {B}x² + {C}")
print(f"Bracket: [{-0.5}, {0.5}]")
print(f"Found minimum: x = {x_opt:.6f}, f = {f_opt:.6f}")
print(f"Steps: {steps}")
print(f"\nAnalytical minimum: x = 0, f = {C}")
Golden section search on f(x) = 1.2x³ + 3.0x² + 0.6
Bracket: [-0.5, 0.5]
Found minimum: x = 0.000000, f = 0.600000
Steps: 29

Analytical minimum: x = 0, f = 0.6
# Visualize the bracket narrowing
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

x_plot = np.linspace(-0.6, 0.6, 200)
axes[0].plot(x_plot, f(x_plot), 'b-', lw=2)

# Show first 6 brackets
colors = plt.cm.viridis(np.linspace(0, 1, min(6, len(hist))))
for k in range(min(6, len(hist))):
    a, b, x1, x2 = hist[k]
    axes[0].axvspan(a, b, alpha=0.1, color=colors[k])
    axes[0].plot([x1, x2], [f(x1), f(x2)], 'o', color=colors[k], ms=4)

axes[0].plot(x_opt, f_opt, 'r*', ms=15, label=f'Minimum x={x_opt:.4f}')
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_title('Golden Section Search: Bracket Narrowing')
axes[0].legend(fontsize=8)
axes[0].grid(alpha=0.3)

# Right: bracket width vs iteration
widths = [h[1] - h[0] for h in hist]
axes[1].semilogy(range(len(widths)), widths, 'ro-', ms=5)
axes[1].plot(range(len(widths)), widths[0] * 0.618**np.arange(len(widths)), 'k--',
             label='$\\varphi^n$ (theory)', alpha=0.6)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Bracket width |b - a|')
axes[1].set_title('Convergence Rate')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Bracket shrinks by factor φ ≈ 0.618 each step")
print(f"After {steps} steps: bracket width = {widths[-1]:.2e}")
print(f"Only 1 function evaluation per step (not 2)!")
_images/279f4b65535f3d57820f3bbe178292a9fea0a946ce31dffbcd9881c304ec7947.png
Bracket shrinks by factor φ ≈ 0.618 each step
After 29 steps: bracket width = 8.70e-07
Only 1 function evaluation per step (not 2)!

Golden Section vs Gradient Descent#

Feature

Gradient Descent

Golden Section

Requires derivative?

Yes

No

Requires bracket?

No

Yes (must contain minimum)

Convergence

Depends on step size

Guaranteed linear (factor 0.618)

Works in N-D?

Yes

Only 1D

Multiple minima?

Finds nearest

Finds the one in the bracket

IV. Extending to 2D: Gradient Descent on Surfaces#

2D Test Functions#

Now we extend gradient descent to functions of two variables. The gradient replaces the derivative:

\[\mathbf{x}_{n+1} = \mathbf{x}_n - \eta \, \nabla f(\mathbf{x}_n)\]

where \(\mathbf{x} = (x, y)\) and \(\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right)\).

We’ll use two test functions:

  • \(f_1\): Single minimum (convex) $\(f_1(x, y) = \frac{x^2}{2} + \frac{y^2}{3} - \frac{xy}{4}\)$

  • \(f_2\): Two minima (non-convex) $\(f_2(x, y) = \frac{x^2}{2} + \frac{y^2}{3} - \frac{xy}{4} + 3 e^{-x^2}\)$

The Gaussian bump in \(f_2\) creates a second local minimum.

def f1(x):
    """Simple function with 1 minimum."""
    return x[0]**2 / 2 + x[1]**2 / 3 - x[0] * x[1] / 4

def f2(x):
    """Function with 2 minima (Gaussian bump)."""
    return x[0]**2 / 2 + x[1]**2 / 3 - x[0] * x[1] / 4 + 3 * np.exp(-x[0]**2)

# Visualize both functions
x_min, x_max = -4, 4
y_min, y_max = -4, 4
nx = np.linspace(x_min, x_max, 400)
ny = np.linspace(y_min, y_max, 400)
x, y = np.meshgrid(nx, ny)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

z1 = f1([x, y])
levels1 = np.arange(np.min(z1), np.max(z1), 0.3)
axes[0].contour(x, y, z1, levels=levels1, cmap='viridis')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('$f_1$: Single Minimum (Convex)')
axes[0].set_aspect('equal')
axes[0].grid(alpha=0.3)

z2 = f2([x, y])
levels2 = np.arange(np.min(z2), np.max(z2), 0.3)
axes[1].contour(x, y, z2, levels=levels2, cmap='viridis')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].set_title('$f_2$: Two Minima (Non-Convex)')
axes[1].set_aspect('equal')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()
_images/e731dc0bc6dfeaf4da0d0e3a8cec1972aaf30c9d085ae1118033757d8e665afd.png

2D Numerical Gradient#

We compute the gradient using central differences in each direction:

\[\frac{\partial f}{\partial x} \approx \frac{f(x + \delta/2, y) - f(x - \delta/2, y)}{\delta}\]
\[\frac{\partial f}{\partial y} \approx \frac{f(x, y + \delta/2) - f(x, y - \delta/2)}{\delta}\]
def derivative2(f, xy, d=0.001):
    """Numerical gradient of a 2D function using central differences."""
    x, y = xy[0], xy[1]
    fx = (f([x + d/2, y]) - f([x - d/2, y])) / d
    fy = (f([x, y + d/2]) - f([x, y - d/2])) / d
    return np.array([fx, fy])

def init(x_min, x_max, y_min, y_max):
    """Generate random starting point in [x_min, x_max] x [y_min, y_max]."""
    x0 = x_min + np.random.random() * (x_max - x_min)
    y0 = y_min + np.random.random() * (y_max - y_min)
    return [x0, y0]

# Test the gradient
test_point = [1.0, 2.0]
grad = derivative2(f1, test_point)
print(f"Gradient of f1 at (1, 2): [{grad[0]:.4f}, {grad[1]:.4f}]")
print(f"Analytical: [{1.0 - 2.0/4:.4f}, {2*2.0/3 - 1.0/4:.4f}]")
Gradient of f1 at (1, 2): [0.5000, 1.0833]
Analytical: [0.5000, 1.0833]

Method 1: Fixed Step Size in 2D#

def minimize_fix(f, x0, dx=0.05, N=1000):
    """2D gradient descent with fixed step size."""
    x_now = np.array(x0, dtype=float)
    converged = False
    x_hist = [x_now.copy()]

    for i in range(N):
        # code...
        x_next = x_now - dx * derivative2(f, x_now)

        if np.linalg.norm(x_next - x_now) < 1e-6:
            converged = True
            break
        else:
            x_now = x_next
            x_hist.append(x_now.copy())

    return converged, np.array(x_hist), f(x_now)

# Test on f1 (single minimum)
x0, y0 = 3.5, 3.5
converged, x_hist, f_min = minimize_fix(f1, [x0, y0], dx=0.1)

z1 = f1([x, y])
plt.figure(figsize=(6, 5))
plt.contour(x, y, z1, levels=np.arange(0, np.max(z1), 0.3), cmap='viridis')
plt.plot(x_hist[:, 0], x_hist[:, 1], 'ro-', ms=3, lw=1, label='Path')
plt.plot(x0, y0, 'go', ms=10, label='Start')
plt.plot(x_hist[-1, 0], x_hist[-1, 1], 'k*', ms=15, label='Solution')
plt.xlabel('x')
plt.ylabel('y')
plt.title(f'Fixed Step on $f_1$: {len(x_hist)} iterations')
plt.legend(fontsize=8)
plt.grid(alpha=0.3)
plt.show()

print(f'Converged: {converged}')
print(f'Start: ({x0}, {y0}), f = {f1([x0, y0]):.4f}')
print(f'End:   ({x_hist[-1, 0]:.4f}, {x_hist[-1, 1]:.4f}), f = {f_min:.4f}')
print(f'Iterations: {len(x_hist)}')
_images/36b27171b8c606cf8a55def06fbea3fbbffaf1c1e2106eec7caf6f6d3a0871c1.png
Converged: True
Start: (3.5, 3.5), f = 7.1458
End:   (0.0000, 0.0000), f = 0.0000
Iterations: 229
# Test on f2 (two minima) — starting point determines which minimum we find
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

z2 = f2([x, y])
levels2 = np.arange(np.min(z2), np.max(z2), 0.3)

# Start from right side
converged, x_hist_r, f_min_r = minimize_fix(f2, [3.5, 3.5])
axes[0].contour(x, y, z2, levels=levels2, cmap='viridis')
axes[0].plot(x_hist_r[:, 0], x_hist_r[:, 1], 'ro-', ms=3, lw=1)
axes[0].plot(3.5, 3.5, 'go', ms=10)
axes[0].plot(x_hist_r[-1, 0], x_hist_r[-1, 1], 'k*', ms=15)
axes[0].set_title(f'Start (3.5, 3.5) → ({x_hist_r[-1,0]:.2f}, {x_hist_r[-1,1]:.2f})')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].grid(alpha=0.3)

# Start from left side
converged, x_hist_l, f_min_l = minimize_fix(f2, [-3.5, 3.5])
axes[1].contour(x, y, z2, levels=levels2, cmap='viridis')
axes[1].plot(x_hist_l[:, 0], x_hist_l[:, 1], 'ro-', ms=3, lw=1)
axes[1].plot(-3.5, 3.5, 'go', ms=10)
axes[1].plot(x_hist_l[-1, 0], x_hist_l[-1, 1], 'k*', ms=15)
axes[1].set_title(f'Start (-3.5, 3.5) → ({x_hist_l[-1,0]:.2f}, {x_hist_l[-1,1]:.2f})')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].grid(alpha=0.3)

plt.suptitle('Fixed Step on $f_2$: Different Starts → Different Minima', fontsize=11)
plt.tight_layout()
plt.show()

print(f"Right start → f = {f_min_r:.4f} ({len(x_hist_r)} steps)")
print(f"Left start  → f = {f_min_l:.4f} ({len(x_hist_l)} steps)")
_images/c6608287112270885f7ca44e45e6b2bcf20b10d9ebc5b840fa4217baa4da7333.png
Right start → f = 1.3096 (355 steps)
Left start  → f = 1.3096 (358 steps)

Method 2: Variable Step Size (Barzilai-Borwein) in 2D#

The 2D generalization of the Barzilai-Borwein method:

\[\eta_n = \frac{(\mathbf{x}_n - \mathbf{x}_{n-1})^T [\nabla f(\mathbf{x}_n) - \nabla f(\mathbf{x}_{n-1})]}{\|\nabla f(\mathbf{x}_n) - \nabla f(\mathbf{x}_{n-1})\|^2}\]
def minimize_var(f, x0, N=1000):
    """2D gradient descent with Barzilai-Borwein variable step size."""
    x_now = np.array(x0, dtype=float)
    x_prev = None
    converged = False
    x_hist = [x_now.copy()]

    for i in range(N):
        df_now = derivative2(f, x_now)

        if x_prev is None:
            dx = 0.01
        else:
            df_prev = derivative2(f, x_prev)
            dd = df_now - df_prev
            denom = np.linalg.norm(dd)**2
            if denom > 1e-16:
                dx = np.dot(x_now - x_prev, dd) / denom
            else:
                dx = 0.01

        x_next = x_now - df_now * dx

        if f(x_next) < f(x_now):
            x_prev = x_now.copy()
            x_now = x_next
            x_hist.append(x_now.copy())
        else:
            converged = True
            break

    return converged, np.array(x_hist), f(x_now)

Comparing Both Methods in 2D#

# Compare both methods on f1 and f2
start_points = [[3.5, 3.5], [3.5, -3.5], [-3.5, 3.5], [-3.5, -3.5]]
methods = [
    ('Fixed Step', minimize_fix),
    ('Variable Step', minimize_var),
]

fig, axes = plt.subplots(2, 2, figsize=(11, 10))

for col, (name, method) in enumerate(methods):
    # f1 (top row)
    axes[0, col].contour(x, y, f1([x, y]),
                         levels=np.arange(0, 12, 0.3), cmap='viridis', alpha=0.6)
    for sp in start_points:
        conv, xh, fm = method(f1, sp)
        axes[0, col].plot(xh[:, 0], xh[:, 1], 'o-', ms=2, lw=1)
        axes[0, col].plot(sp[0], sp[1], 'go', ms=6)
    axes[0, col].set_title(f'{name} on $f_1$')
    axes[0, col].set_xlabel('x')
    axes[0, col].set_ylabel('y')
    axes[0, col].grid(alpha=0.3)

    # f2 (bottom row)
    axes[1, col].contour(x, y, f2([x, y]),
                         levels=np.arange(np.min(z2), 12, 0.3), cmap='viridis', alpha=0.6)
    for sp in start_points:
        conv, xh, fm = method(f2, sp)
        axes[1, col].plot(xh[:, 0], xh[:, 1], 'o-', ms=2, lw=1)
        axes[1, col].plot(sp[0], sp[1], 'go', ms=6)
    axes[1, col].set_title(f'{name} on $f_2$')
    axes[1, col].set_xlabel('x')
    axes[1, col].set_ylabel('y')
    axes[1, col].grid(alpha=0.3)

plt.suptitle('Comparing Gradient Descent Methods: 4 Starting Points', fontsize=12)
plt.tight_layout()
plt.show()
_images/7e311a4358e58f9bfdf105db38603dcf29e4e5015fe4fc3dc0f96d5d81c0e379.png
# Quantitative comparison on f2
print("Comparison on f2 (two minima)")
print(f'{"Start":>14s} | {"Method":>12s} | {"Conv":>5s} | {"Final (x, y)":>20s} | {"f_min":>8s} | {"Steps":>5s}')
print('-' * 80)

for sp in start_points:
    for name, method in methods:
        conv, xh, fm = method(f2, sp)
        end = xh[-1]
        print(f'({sp[0]:5.1f},{sp[1]:5.1f}) | {name:>12s} | {str(conv):>5s} | ({end[0]:8.4f}, {end[1]:8.4f}) | {fm:8.4f} | {len(xh):5d}')
    print()
Comparison on f2 (two minima)
         Start |       Method |  Conv |         Final (x, y) |    f_min | Steps
--------------------------------------------------------------------------------
(  3.5,  3.5) |   Fixed Step |  True | (  1.3748,   0.5156) |   1.3096 |   355
(  3.5,  3.5) | Variable Step |  True | (  1.3762,   0.5162) |   1.3096 |     9

(  3.5, -3.5) |   Fixed Step |  True | (  1.3748,   0.5155) |   1.3096 |   358
(  3.5, -3.5) | Variable Step |  True | ( -0.3739,  -0.6540) |   2.7599 |     3

( -3.5,  3.5) |   Fixed Step |  True | ( -1.3748,  -0.5155) |   1.3096 |   358
( -3.5,  3.5) | Variable Step |  True | (  0.3739,   0.6540) |   2.7599 |     3

( -3.5, -3.5) |   Fixed Step |  True | ( -1.3748,  -0.5156) |   1.3096 |   355
( -3.5, -3.5) | Variable Step |  True | ( -1.3762,  -0.5162) |   1.3096 |     9

V. Using scipy.optimize#

Professional Optimization with scipy.optimize.minimize#

In practice, we use well-tested library implementations. scipy.optimize.minimize provides many algorithms through a unified interface:

Method

Type

Requires

Best for

'Nelder-Mead'

Simplex (gradient-free)

Only \(f\)

Noisy or non-smooth functions

'Powell'

Direction set (gradient-free)

Only \(f\)

Low-dimensional smooth functions

'CG'

Conjugate gradient

\(f\), \(\nabla f\)

Large-scale smooth problems

'BFGS'

Quasi-Newton

\(f\), \(\nabla f\)

General smooth optimization

'Newton-CG'

Newton + CG

\(f\), \(\nabla f\), \(H\)

When Hessian is available

We’ll study Newton, BFGS, and CG in detail in the next lecture. For now, let’s see them in action.

from scipy.optimize import minimize as sp_minimize

x0 = [3.5, 3.5]

print(f"Minimizing f2 from starting point {x0}")
print("=" * 60)

for method in ['Nelder-Mead', 'Powell', 'CG', 'BFGS']:
    res = sp_minimize(f2, x0, method=method, tol=1e-6)
    print(f"\n{method}:")
    print(f"  Solution: ({res.x[0]:.6f}, {res.x[1]:.6f})")
    print(f"  f_min:    {res.fun:.6f}")
    print(f"  Evaluations: {res.nfev}")
    print(f"  Converged: {res.success}")
Minimizing f2 from starting point [3.5, 3.5]
============================================================

Nelder-Mead:
  Solution: (1.374845, 0.515567)
  f_min:    1.309622
  Evaluations: 114
  Converged: True

Powell:
  Solution: (1.374845, 0.515567)
  f_min:    1.309622
  Evaluations: 126
  Converged: True

CG:
  Solution: (-1.374845, -0.515566)
  f_min:    1.309622
  Evaluations: 60
  Converged: True

BFGS:
  Solution: (1.374845, 0.515567)
  f_min:    1.309622
  Evaluations: 27
  Converged: True
# Visual comparison: our methods vs scipy
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

z2 = f2([x, y])
levels2 = np.arange(np.min(z2), np.max(z2), 0.3)

# Left: our variable step method
conv, xh_ours, fm = minimize_var(f2, [3.5, -3.5])
axes[0].contour(x, y, z2, levels=levels2, cmap='viridis', alpha=0.6)
axes[0].plot(xh_ours[:, 0], xh_ours[:, 1], 'ro-', ms=3, lw=1)
axes[0].set_title(f'Our Variable Step ({len(xh_ours)} steps)')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].grid(alpha=0.3)

# Right: scipy BFGS (collect trajectory using callback)
trajectory = [[3.5, -3.5]]
def callback(xk):
    trajectory.append(xk.tolist())

res = sp_minimize(f2, [3.5, 3.5], method='BFGS', callback=callback, tol=1e-6)
traj = np.array(trajectory)

axes[1].contour(x, y, z2, levels=levels2, cmap='viridis', alpha=0.6)
axes[1].plot(traj[:, 0], traj[:, 1], 'ro-', ms=3, lw=1)
axes[1].set_title(f'scipy BFGS ({len(traj)} steps)')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].grid(alpha=0.3)

plt.suptitle('Our Method vs scipy on $f_2$', fontsize=11)
plt.tight_layout()
plt.show()

print(f"Our method: {len(xh_ours)} steps, f = {fm:.6f}")
print(f"scipy BFGS: {len(traj)} steps, f = {res.fun:.6f}")
print(f"\nBFGS is much more efficient — we'll learn why in the next lecture!")
_images/ecb5b0efc4c566172ea90ddd4ce94d67de2fe84f441d96ec3982bd1fc96bab84.png
Our method: 3 steps, f = 2.759868
scipy BFGS: 9 steps, f = 1.309622

BFGS is much more efficient — we'll learn why in the next lecture!

Summary: What We Learned Today#

Method

Derivatives?

Dimension

Key Idea

Gradient descent (fixed \(\eta\))

Yes

Any

Simple but sensitive to \(\eta\)

Barzilai-Borwein (variable \(\eta\))

Yes

Any

Adapts \(\eta\) from previous step

Golden section search

No

1D only

Bracket narrowing via golden ratio

Nelder-Mead (scipy)

No

Any

Simplex exploration

BFGS (scipy)

Yes

Any

Quasi-Newton (next lecture!)

previous

Lecture 14: Monte Carlo Methods II — MCMC & Optimization

next

Lecture 16: Optimization II — Newton’s Method, BFGS, and Conjugate Gradient

Contents
  • Finding Minima: The Core Problem of Computational Science
  • I. The Optimization Landscape
    • Local vs Global Minima
  • II. 1D Optimization: Gradient Descent
    • The Gradient Descent Algorithm
    • Why Step Size Matters
    • Test Function: A Cubic Polynomial
    • Implementation: Fixed Step Size
    • Sensitivity to Starting Point
    • Improvement: Variable Step Size (Barzilai-Borwein Method)
  • III. Golden Section Search: No Derivatives Needed
    • A Different Approach: Bracket-Based Search
    • The Golden Section Search
    • Why the Golden Ratio?
    • Golden Section vs Gradient Descent
  • IV. Extending to 2D: Gradient Descent on Surfaces
    • 2D Test Functions
    • 2D Numerical Gradient
    • Method 1: Fixed Step Size in 2D
    • Method 2: Variable Step Size (Barzilai-Borwein) in 2D
    • Comparing Both Methods in 2D
  • V. Using scipy.optimize
    • Professional Optimization with scipy.optimize.minimize
    • Summary: What We Learned Today

By Li Zhu

© Copyright 2026.