MATH 697 Analysis and Machine Learning

In this course we will cover some of fundamental ideas from analysis, statistics, and optimization that are relevant to methods in machine learning and statistical inference. The class will cover not only the most well known linear methods, but also the more recently developed nonlinear methods that use the intuition from classical topics in PDE and the calculus of variations, such as the theory minimal surfaces, optimal transport, and gradient flows.

Minimal requirements: Undergraduate real analysis (basics of metric spaces, integration), basic probability (distributions, random variables), strong background in calculus and linear algebra.

Ideal requirements: Familiarity with one or more of the following topics: measure theory, differentiable manifolds, linear programming.

Textbook: The Elements of Statistical Learning (Springer Series in Statistics), 2nd edition. Authors: Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Lecture Time: Tue-Thur 8:30-9:45 am in Lederle Grad. Res. 1234
Office Hours: Mon 3 pm-5 pm. Tue 10 am-11am.

Evaluation: Problem sets (60%). Project (40%).
Solutions to problem sets are to be submitted in latex format, via email. No late problem sets will be accepted. Out of 8 problem sets, lowest two grades will be dropped in computing the total.
The project will be done in teams of two people (if needed, a couple of teams with 3 people will be allowed)

Updated! (November 10th) Important dates for the project: Tuesday of 11th week. Thursday of 15th week.

Problem sets

First   Second   Third  Fourth  Fifth

Lecture plan and slides (updated 10/23)

• Week 1.  (Slides)
Tuesday class: Review prerequisites, basic facts, and notation. Function approximation/estimation as motivation. Four terms: supervised/unsupervised learning, regression, and classification. A glance at the latter half of the courses: the calculus of variations, partial differential equations, and optimization.
Reading: (From HTF) Introduction, Sections 2.1 and 2.2.

Thursday class: Least squares versus $k$-nearest neighbor. Statistical decision theory.
Reading: (From HTF) Sections 2.3 and 2.4.

• Week 2.  (Slides)
Tuesday class: Curse of dimensionality and how it affects $k$-NN and least squares. Mean squared error and expected prediction error. Statistical models and restricted function classes.
Reading: (From HTF) Sections 2.6 and 2.7.

Thursday class: Maximum likelihood. Least squares and ML for additive models. More about restricted regression models: roughness penalty, kernel methods, and local regression.
Reading: (From HTF) Sections 2.8.1 and 2.8.2.

• Week 3.  (Slides)
Tuesday class: Model selection: Training error versus generalization error. Parameter inference: Gaussian model for the error, and how $\hat \beta$ is distributed. The Gauss Markov Theorem.
Reading: (From HTF) Sections 2.9, 3.1, and Section 3.2.2.

Thursday class: The singular value decomposition and least squares. Univariate versus Multivariate regression. Shortcomings of least squares and the bias-variance dichotomy. Subset selection. Ridge regression and intro to the Lasso.
Reading: (From HTF) Section 3.2.3, 3.3.1, and Section 3.4.1 up to page 64.

• Week 4.  (Slides)
Tuesday class: A rapid course on convex functions (the Legendre transform, the subdifferential, critical points and the subdifferential). The Lasso and characterization of minimizers through the subdifferential. Example: the Lasso for an orthogonal input matrix and the shrinkage operator.
Reading: (From HTF) Rest of Section 3.4.1, Section 3.4.2.

Thursday class: Discussion of the first problem set (distributions, convolutions, approximations to the identity). Lasso as a constrained quadratic program. The geometry of $\ell^1$ and how it explains the sparsity of Lasso solutions.
Reading: (From HTF) Section 3.4.3 and Section 3.6.

• Week 5-6.  (No class)

• Week 7.  (Slides)
Tuesday class: Using the Lasso with least squares to minimize bias. Inference when $p>>N$ via the Dantzig selector. The Restricted Isometry Property and a Theorem of Candes and Tao. Summary of linear regression. Basics of linear classification: regression via indicator matrices, masking.
Reading: (From HTF) Sections 3.8.3 and 3.8.4, Sections 4.1 and 4.2.

Thursday class: Linear and quadratic discriminant analysis with Gaussian mixtures. Logistic regression.
Reading: (From HTF) Section 4.3 (up to page 112), Sections 4.4.1.

• Week 8.  (Slides)
Tuesday class: More on Logistic Regression and gradient descent. Logistic regression versus LDA. Rosenblatt's separating hyperplane algorithm and Vapnik's optimality criterion.
Reading: (From HTF) Sections 4.4.1, 4.4.5, and Section 4.5

Thursday class: A few thoughts on elliptic PDE and the Calculus of Variations. The Laplacian in $\mathbb{R}^d$ and its many versions. Fourier series, eigenfunctions of the second derivative, and the heat equation.
Reading: (From HTF) Sections 7.1, 7.2, 7.4 and 7.5.

• Week 9.  (Slides)
Tuesday class: More on the calculus of variations: the Euler-Lagrange equation and Dirichlet's principle for harmonic functions. Historical detour: Hilbert's 19th problem. The smoothing effect of the heat equation. The mean value property for harmonic functions.

Thursday class: The strong maximum principle. The comparison principle for Laplace's equation and its consequences. The fractional Laplacian.

• Week 10.  (Slides)
Tuesday class: Infinite differentiability of harmonic functions and interior derivative estimates. Graphs, their weights and graph Laplacians.

Thursday class: The strong maximum principle in a graph. Measuring smoothness for functions in a graph. Spectral properties of the combinatorial Laplacian, and its eigenfunction decomposition. The heat kernel in a graph.

• Week 11.  (Slides)
Tuesday class: The Dirichlet problem on a grpah. Semi-supervised learning and harmonic minimization.
Tuesday class: More clustering: Dissimilarity matrices, the $K$-means algorithm and an example in vector quantization. A comparison of various metrics'' over distributions: $L^p$, Total Variation, the Earth Mover Distance, the Kullback-Leibler divergence.