Qinxin Yan, Princeton University
It is widely observed that overparameterized neural networks—often with more parameters than training samples—can interpolate the training data while still generalizing well. One theoretical approach to this phenomenon studies the dynamics of gradient-based training. Empirically and in several settings theoretically, gradient descent is seen to converge to particular “simpler” solutions among many minimizers, a bias commonly referred to as implicit regularization. Early stopping during the training process can further reduce effective model complexity and often improves generalization.
In this talk, we adopt the mean-field formulation on wide neural networks, representing the network by a probability measure over parameters and viewing training as a gradient flow on Wasserstein space. Building on this viewpoint, we introduce a mean-field control formulation of the training dynamics. This control perspective, together with dynamic programming principle, leads to a mean-field analogue of the Wasserstein-2 distance and provides a framework for analyzing early stopping and implicit regularization.