About me

I am currently an IFDS postdoc fellow at the University of Washington, working with Dmitriy Drusvyatskiy and Maryam Fazel. I received my PhD in Computer Science from UCSD, where I was fortunate to be advised by Mikhail Belkin. Prior to UCSD, I received my BS in Mathematics from Zhejiang University.

I am broadly interested in the optimization and mathematical foundations of machine learning. My research has focused on understanding the dynamics of over-parameterized neural networks, particularly in the Neural Tangent Kernel (NTK) and feature learning regimes. In recent work, we showed that kernel machines are capable of implementing feature and hierarchical learning (paper).

🚀 I am currently on the job market for full‑time positions.

Email: libinzhu at uw dot edu

Selected Papers

Iteratively reweighted kernel machines efficiently learn sparse functions
L. Zhu, D. Davis, D. Drusvyatskiy, and M. Fazel. Preprinted.

Abstract
The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, we argue that these two phenomena are not unique to neural networks, and can be elicited from classical kernel methods. Namely, we show that the derivative of the kernel predictor can detect the influential coordinates with low sample complexity. Moreover, by iteratively using the derivatives to reweight the data and retrain kernel machines, one is able to efficiently learn hierarchical polynomials with finite leap complexity. Numerical experiments illustrate the developed theory.
Emergence in non‑neural models: grokking modular arithmetic via average gradient outer product
N. Mallinar, D. Beaglehole, L. Zhu, A. Radhakrishnan, P. Pandit, and M. Belkin. ICML 2025 (Oral).

Abstract
Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.
Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning
L. Zhu, C. Liu, A. Radhakrishnan, and M. Belkin. ICML 2024.

Abstract
In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.
Quadratic models for understanding catapult dynamics of neural networks
L. Zhu, C. Liu, A. Radhakrishnan, and M. Belkin. ICLR 2024.

Abstract
While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. In this work we show that recently proposed Neural Quadratic Models can exhibit the "catapult phase" [Lewkowycz et al. 2020] that arises when training such models with large learning rates. We then empirically show that the behaviour of neural quadratic models parallels that of neural networks in generalization, especially in the catapult phase regime. Our analysis further demonstrates that quadratic models can be an effective tool for analysis of neural networks.
Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture
C. Liu, L. Zhu, and M. Belkin. NeurIPS 2022.

Abstract
In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity. The width of these general networks is characterized by the minimum in-degree of their neurons, except for the input and first layers. Our results identify the mathematical structure underlying transition to linearity and generalize a number of recent works aimed at characterizing transition to linearity or constancy of the Neural Tangent Kernel for standard architectures.
On the linearity of large non‑linear models: when and why the tangent kernel is constant
C. Liu, L. Zhu, and M. Belkin. NeurIPS 2020 (Spotlight).

Abstract
The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian scaling applicable to the standard classes of neural networks. Our analysis provides a new perspective on the phenomenon of constant tangent kernel, which is different from the widely accepted "lazy training". Furthermore, we show that the transition to linearity is not a general property of wide neural networks and does not hold when the last layer of the network is non-linear. It is also not necessary for successful optimization by gradient descent.
Loss landscapes and optimization in over‑parameterized non‑linear systems and neural networks
C. Liu, L. Zhu, and M. Belkin. Applied and Computational Harmonic Analysis (ACHA) 2022.

Abstract
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization problems corresponding to such systems are generally not convex, even locally. We argue that instead they satisfy PL∗, a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL∗ condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL∗-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL∗ condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL∗ condition applicable to "almost" over-parameterized systems.

Libin Zhu

About me

Selected Papers